Q003 — Assessment¶


Research	R0023 — Counterproductive advice and prompt lifecycle
Run	2026-03-25
Query	Q003

BLUF¶

One landmark study provides strong evidence for prompt degradation: Chen, Zaharia, and Zou (Stanford/Berkeley, 2023) documented GPT-4 accuracy dropping from 84% to 51% on prime number identification within 3 months. However, the published evidence base is narrow — essentially one rigorous study plus industry anecdotes. The Chen study itself shows mixed effects (degradation on some tasks, improvement on others), and the Wharton variability findings suggest that some perceived degradation may be normal stochastic variation. Prompt degradation is real but more complex than simple performance decline.

Probability¶

Rating: Likely (55-80%) that the mixed-effects answer (H3) best characterizes the evidence

Confidence in assessment: Medium

Confidence rationale: One very strong primary source (Chen et al.) but limited replication. Industry consensus supports the phenomenon but without rigorous data. The Wharton variability finding introduces a genuine complicating factor.

Reasoning Chain¶

Chen et al. (2023) compared GPT-3.5 and GPT-4 across March and June 2023 versions on 7 task categories. GPT-4 prime number accuracy dropped 33 percentage points. [SRC01-E01, High reliability, High relevance]
The same study found mixed effects: multi-hop knowledge questions improved in GPT-4 while code generation degraded. [SRC01-E02, High reliability, High relevance]
Wharton GAIL Report 1 shows identical prompts produce inconsistent results within a single model version (30.28% perfect accuracy across 100 repetitions). [SRC02-E01, High reliability, Medium relevance]
Deepchecks reports prompt updates as the primary source of production incidents but provides no specific data. [SRC03-E01, Medium-Low reliability, Medium relevance]
JUDGMENT: Prompt degradation is a documented phenomenon with strong evidence from one study. The evidence shows mixed effects (not uniform degradation) and is complicated by baseline stochastic variation. More research is needed for confident generalization.

Evidence Base Summary¶

Source	Description	Reliability	Relevance	Key Finding
SRC01	Chen et al. ChatGPT drift study	High	High	84% to 51% accuracy drop; mixed effects across tasks
SRC02	Wharton GAIL Report 1	High	Medium	Stochastic variation complicates degradation detection
SRC03	Deepchecks industry analysis	Medium-Low	Medium	Industry claims without data

Collection Synthesis¶

Dimension	Assessment
Evidence quality	Medium — one strong study, one tangentially relevant study, one weak industry source
Source agreement	Medium — all agree degradation exists, but disagree on magnitude and universality
Source independence	High — Stanford/Berkeley, Wharton, and industry are independent
Outliers	None

Detail¶

The evidence base for Q003 is notably thinner than for Q001. The field lacks systematic, ongoing monitoring studies that track prompt performance across multiple model versions. Chen et al. captured a snapshot (March vs. June 2023) but there is no continuous monitoring framework producing published results.

Gaps¶

Missing Evidence	Impact on Assessment
Systematic multi-version comparison studies beyond Chen et al.	Would strengthen or challenge the degradation finding
Longitudinal prompt performance tracking data	No continuous monitoring results published
Quantified prevalence of prompt degradation in production	Industry claims lack supporting data
Claude, Gemini, and other non-OpenAI model version comparisons	Chen et al. only tested GPT models

Researcher Bias Check¶

Declared biases: No researcher profile provided.

Influence assessment: The query asks about "prompt degradation" which frames the phenomenon as negative. The research found that model updates produce mixed effects, which is a more neutral finding than the question implies.

Cross-References¶

Entity	ID	File
Hypotheses	H1, H2, H3	`hypotheses/`
Sources	SRC01, SRC02, SRC03	`sources/`
ACH Matrix	—	`ach-matrix.md`
Self-Audit	—	`self-audit.md`