R0023/2026-03-25/Q003/H1¶
Statement¶
Strong published evidence documents prompt degradation as a real and significant phenomenon, with multiple peer-reviewed studies demonstrating measurable performance drops when identical prompts are applied to updated models.
Status¶
Current: Partially supported
One landmark study (Chen et al., 2023) provides strong evidence with dramatic metrics (84% to 51% accuracy drop). However, the evidence base is narrower than H1 implies — this is essentially one study plus industry anecdotes, not "multiple peer-reviewed studies." The phenomenon is real, but the published evidence is concentrated rather than broadly replicated.
Supporting Evidence¶
| Evidence | Summary |
|---|---|
| SRC01-E01 | GPT-4: 84% to 51% accuracy drop on prime numbers in 3 months |
| SRC03-E01 | Industry reports prompt updates as primary source of production incidents |
Contradicting Evidence¶
| Evidence | Summary |
|---|---|
| SRC02-E01 | Baseline stochastic variation may account for some reported degradation |
Reasoning¶
H1 is partially supported. Chen et al. provide the strongest evidence, but it is essentially a single study. The industry voice (Deepchecks) adds practitioner consensus but lacks empirical rigor. The Wharton variability finding introduces a complicating factor: some perceived degradation may be normal noise.
Relationship to Other Hypotheses¶
H1 is the "strong form" claim. The evidence supports H3 (mixed effects, complex reality) more than H1 (strong, broad evidence base).