R0023/2026-03-25/Q003/H1¶


Research	R0023 — Counterproductive advice and prompt lifecycle
Run	2026-03-25
Query	Q003
Hypothesis	H1

Statement¶

Strong published evidence documents prompt degradation as a real and significant phenomenon, with multiple peer-reviewed studies demonstrating measurable performance drops when identical prompts are applied to updated models.

Status¶

Current: Partially supported

One landmark study (Chen et al., 2023) provides strong evidence with dramatic metrics (84% to 51% accuracy drop). However, the evidence base is narrower than H1 implies — this is essentially one study plus industry anecdotes, not "multiple peer-reviewed studies." The phenomenon is real, but the published evidence is concentrated rather than broadly replicated.

Supporting Evidence¶

Evidence	Summary
SRC01-E01	GPT-4: 84% to 51% accuracy drop on prime numbers in 3 months
SRC03-E01	Industry reports prompt updates as primary source of production incidents

Contradicting Evidence¶

Evidence	Summary
SRC02-E01	Baseline stochastic variation may account for some reported degradation

Reasoning¶

H1 is partially supported. Chen et al. provide the strongest evidence, but it is essentially a single study. The industry voice (Deepchecks) adds practitioner consensus but lacks empirical rigor. The Wharton variability finding introduces a complicating factor: some perceived degradation may be normal noise.

Relationship to Other Hypotheses¶

H1 is the "strong form" claim. The evidence supports H3 (mixed effects, complex reality) more than H1 (strong, broad evidence base).