Skip to content

R0023/2026-03-25/Q003/H1

Statement

Strong published evidence documents prompt degradation as a real and significant phenomenon, with multiple peer-reviewed studies demonstrating measurable performance drops when identical prompts are applied to updated models.

Status

Current: Partially supported

One landmark study (Chen et al., 2023) provides strong evidence with dramatic metrics (84% to 51% accuracy drop). However, the evidence base is narrower than H1 implies — this is essentially one study plus industry anecdotes, not "multiple peer-reviewed studies." The phenomenon is real, but the published evidence is concentrated rather than broadly replicated.

Supporting Evidence

Evidence Summary
SRC01-E01 GPT-4: 84% to 51% accuracy drop on prime numbers in 3 months
SRC03-E01 Industry reports prompt updates as primary source of production incidents

Contradicting Evidence

Evidence Summary
SRC02-E01 Baseline stochastic variation may account for some reported degradation

Reasoning

H1 is partially supported. Chen et al. provide the strongest evidence, but it is essentially a single study. The industry voice (Deepchecks) adds practitioner consensus but lacks empirical rigor. The Wharton variability finding introduces a complicating factor: some perceived degradation may be normal noise.

Relationship to Other Hypotheses

H1 is the "strong form" claim. The evidence supports H3 (mixed effects, complex reality) more than H1 (strong, broad evidence base).