R0023/2026-03-25/Q003 — Assessment¶
BLUF¶
One landmark study provides strong evidence for prompt degradation: Chen, Zaharia, and Zou (Stanford/Berkeley, 2023) documented GPT-4 accuracy dropping from 84% to 51% on prime number identification within 3 months. However, the published evidence base is narrow — essentially one rigorous study plus industry anecdotes. The Chen study itself shows mixed effects (degradation on some tasks, improvement on others), and the Wharton variability findings suggest that some perceived degradation may be normal stochastic variation. Prompt degradation is real but more complex than simple performance decline.
Probability¶
Rating: Likely (55-80%) that the mixed-effects answer (H3) best characterizes the evidence
Confidence in assessment: Medium
Confidence rationale: One very strong primary source (Chen et al.) but limited replication. Industry consensus supports the phenomenon but without rigorous data. The Wharton variability finding introduces a genuine complicating factor.
Reasoning Chain¶
- Chen et al. (2023) compared GPT-3.5 and GPT-4 across March and June 2023 versions on 7 task categories. GPT-4 prime number accuracy dropped 33 percentage points. [SRC01-E01, High reliability, High relevance]
- The same study found mixed effects: multi-hop knowledge questions improved in GPT-4 while code generation degraded. [SRC01-E02, High reliability, High relevance]
- Wharton GAIL Report 1 shows identical prompts produce inconsistent results within a single model version (30.28% perfect accuracy across 100 repetitions). [SRC02-E01, High reliability, Medium relevance]
- Deepchecks reports prompt updates as the primary source of production incidents but provides no specific data. [SRC03-E01, Medium-Low reliability, Medium relevance]
- JUDGMENT: Prompt degradation is a documented phenomenon with strong evidence from one study. The evidence shows mixed effects (not uniform degradation) and is complicated by baseline stochastic variation. More research is needed for confident generalization.
Evidence Base Summary¶
| Source | Description | Reliability | Relevance | Key Finding |
|---|---|---|---|---|
| SRC01 | Chen et al. ChatGPT drift study | High | High | 84% to 51% accuracy drop; mixed effects across tasks |
| SRC02 | Wharton GAIL Report 1 | High | Medium | Stochastic variation complicates degradation detection |
| SRC03 | Deepchecks industry analysis | Medium-Low | Medium | Industry claims without data |
Collection Synthesis¶
| Dimension | Assessment |
|---|---|
| Evidence quality | Medium — one strong study, one tangentially relevant study, one weak industry source |
| Source agreement | Medium — all agree degradation exists, but disagree on magnitude and universality |
| Source independence | High — Stanford/Berkeley, Wharton, and industry are independent |
| Outliers | None |
Detail¶
The evidence base for Q003 is notably thinner than for Q001. The field lacks systematic, ongoing monitoring studies that track prompt performance across multiple model versions. Chen et al. captured a snapshot (March vs. June 2023) but there is no continuous monitoring framework producing published results.
Gaps¶
| Missing Evidence | Impact on Assessment |
|---|---|
| Systematic multi-version comparison studies beyond Chen et al. | Would strengthen or challenge the degradation finding |
| Longitudinal prompt performance tracking data | No continuous monitoring results published |
| Quantified prevalence of prompt degradation in production | Industry claims lack supporting data |
| Claude, Gemini, and other non-OpenAI model version comparisons | Chen et al. only tested GPT models |
Researcher Bias Check¶
Declared biases: No researcher profile provided.
Influence assessment: The query asks about "prompt degradation" which frames the phenomenon as negative. The research found that model updates produce mixed effects, which is a more neutral finding than the question implies.
Cross-References¶
| Entity | ID | File |
|---|---|---|
| Hypotheses | H1, H2, H3 | hypotheses/ |
| Sources | SRC01, SRC02, SRC03 | sources/ |
| ACH Matrix | — | ach-matrix.md |
| Self-Audit | — | self-audit.md |