Q003¶


Research	R0023 — Counterproductive advice and prompt lifecycle
Run	2026-03-25
Query	Q003

Query: What published evidence exists on prompt degradation over time — prompts that worked with one model version failing or producing different results after model updates?

BLUF: One landmark study provides strong evidence: Chen, Zaharia, and Zou (Stanford/Berkeley, 2023) documented GPT-4 accuracy dropping from 84% to 51% on prime number identification within 3 months between model versions. However, the published evidence base is narrow — essentially one rigorous study. That study itself shows mixed effects (some tasks degraded, others improved), and separate research on prompt variability suggests some perceived degradation may be normal stochastic noise. The phenomenon is real but the evidence base is thin and the reality is more complex than simple "prompts stop working."

Answer: H3 (Complex mixed effects — degradation is real but oversimplified) · Confidence: Medium

Summary¶

Entity	Description
Query Definition	Question as received, clarified, ambiguities, sub-questions
Assessment	Full analytical product
ACH Matrix	Evidence × hypotheses diagnosticity analysis
Self-Audit	ROBIS-adapted 4-domain process audit

Hypotheses¶

ID	Statement	Status
H1	Strong evidence documents prompt degradation as significant	Partially supported
H2	Published evidence is sparse/anecdotal	Partially supported
H3	Evidence shows complex mixed effects, not simple degradation	Supported

Key Degradation Metrics¶

Study	Model	Task	Metric	Change	Timeframe
Chen et al. 2023	GPT-4	Prime number ID	Accuracy	84% → 51% (-33pp)	3 months
Chen et al. 2023	GPT-4	Code generation	Formatting errors	Increased	3 months
Chen et al. 2023	GPT-4	Multi-hop questions	Accuracy	Improved	3 months
Chen et al. 2023	GPT-3.5	Multi-hop questions	Accuracy	Declined	3 months

Searches¶

ID	Target	Type	Outcome
S01	Published evidence on prompt degradation	WebSearch	20 returned, 5 selected

Sources¶

Source	Description	Reliability	Relevance	Evidence
SRC01	Chen et al. ChatGPT drift (Stanford/Berkeley)	High	High	2 extracts
SRC02	Wharton GAIL Report 1 (variability)	High	Medium	1 extract
SRC03	Deepchecks industry analysis	Medium-Low	Medium	1 extract

Revisit Triggers¶

Publication of systematic cross-version comparison studies for Claude, Gemini, or other model families
Longitudinal prompt monitoring studies with continuous data collection
Meta-analysis aggregating cross-version performance data