Skip to content

R0023/2026-03-25/Q003

Query: What published evidence exists on prompt degradation over time — prompts that worked with one model version failing or producing different results after model updates?

BLUF: One landmark study provides strong evidence: Chen, Zaharia, and Zou (Stanford/Berkeley, 2023) documented GPT-4 accuracy dropping from 84% to 51% on prime number identification within 3 months between model versions. However, the published evidence base is narrow — essentially one rigorous study. That study itself shows mixed effects (some tasks degraded, others improved), and separate research on prompt variability suggests some perceived degradation may be normal stochastic noise. The phenomenon is real but the evidence base is thin and the reality is more complex than simple "prompts stop working."

Answer: H3 (Complex mixed effects — degradation is real but oversimplified) · Confidence: Medium


Summary

Entity Description
Query Definition Question as received, clarified, ambiguities, sub-questions
Assessment Full analytical product
ACH Matrix Evidence × hypotheses diagnosticity analysis
Self-Audit ROBIS-adapted 4-domain process audit

Hypotheses

ID Statement Status
H1 Strong evidence documents prompt degradation as significant Partially supported
H2 Published evidence is sparse/anecdotal Partially supported
H3 Evidence shows complex mixed effects, not simple degradation Supported

Key Degradation Metrics

Study Model Task Metric Change Timeframe
Chen et al. 2023 GPT-4 Prime number ID Accuracy 84% → 51% (-33pp) 3 months
Chen et al. 2023 GPT-4 Code generation Formatting errors Increased 3 months
Chen et al. 2023 GPT-4 Multi-hop questions Accuracy Improved 3 months
Chen et al. 2023 GPT-3.5 Multi-hop questions Accuracy Declined 3 months

Searches

ID Target Type Outcome
S01 Published evidence on prompt degradation WebSearch 20 returned, 5 selected

Sources

Source Description Reliability Relevance Evidence
SRC01 Chen et al. ChatGPT drift (Stanford/Berkeley) High High 2 extracts
SRC02 Wharton GAIL Report 1 (variability) High Medium 1 extract
SRC03 Deepchecks industry analysis Medium-Low Medium 1 extract

Revisit Triggers

  • Publication of systematic cross-version comparison studies for Claude, Gemini, or other model families
  • Longitudinal prompt monitoring studies with continuous data collection
  • Meta-analysis aggregating cross-version performance data