R0023/2026-03-25/Q003
Query: What published evidence exists on prompt degradation over time — prompts that worked with one model version failing or producing different results after model updates?
BLUF: One landmark study provides strong evidence: Chen, Zaharia, and Zou (Stanford/Berkeley, 2023) documented GPT-4 accuracy dropping from 84% to 51% on prime number identification within 3 months between model versions. However, the published evidence base is narrow — essentially one rigorous study. That study itself shows mixed effects (some tasks degraded, others improved), and separate research on prompt variability suggests some perceived degradation may be normal stochastic noise. The phenomenon is real but the evidence base is thin and the reality is more complex than simple "prompts stop working."
Answer: H3 (Complex mixed effects — degradation is real but oversimplified) · Confidence: Medium
Summary
| Entity |
Description |
| Query Definition |
Question as received, clarified, ambiguities, sub-questions |
| Assessment |
Full analytical product |
| ACH Matrix |
Evidence × hypotheses diagnosticity analysis |
| Self-Audit |
ROBIS-adapted 4-domain process audit |
Hypotheses
| ID |
Statement |
Status |
| H1 |
Strong evidence documents prompt degradation as significant |
Partially supported |
| H2 |
Published evidence is sparse/anecdotal |
Partially supported |
| H3 |
Evidence shows complex mixed effects, not simple degradation |
Supported |
Key Degradation Metrics
| Study |
Model |
Task |
Metric |
Change |
Timeframe |
| Chen et al. 2023 |
GPT-4 |
Prime number ID |
Accuracy |
84% → 51% (-33pp) |
3 months |
| Chen et al. 2023 |
GPT-4 |
Code generation |
Formatting errors |
Increased |
3 months |
| Chen et al. 2023 |
GPT-4 |
Multi-hop questions |
Accuracy |
Improved |
3 months |
| Chen et al. 2023 |
GPT-3.5 |
Multi-hop questions |
Accuracy |
Declined |
3 months |
Searches
| ID |
Target |
Type |
Outcome |
| S01 |
Published evidence on prompt degradation |
WebSearch |
20 returned, 5 selected |
Sources
| Source |
Description |
Reliability |
Relevance |
Evidence |
| SRC01 |
Chen et al. ChatGPT drift (Stanford/Berkeley) |
High |
High |
2 extracts |
| SRC02 |
Wharton GAIL Report 1 (variability) |
High |
Medium |
1 extract |
| SRC03 |
Deepchecks industry analysis |
Medium-Low |
Medium |
1 extract |
Revisit Triggers
- Publication of systematic cross-version comparison studies for Claude, Gemini, or other model families
- Longitudinal prompt monitoring studies with continuous data collection
- Meta-analysis aggregating cross-version performance data