R0023/2026-03-25/Q003 — Query Definition¶
Query as Received¶
What published evidence exists on prompt degradation over time — prompts that worked with one model version failing or producing different results after model updates?
Query as Clarified¶
- Subject: The phenomenon of previously effective prompts producing degraded results after model updates
- Scope: Published empirical evidence documenting measurable performance changes when the same prompts are applied to different versions of the same model
- Evidence basis: Peer-reviewed papers, technical reports, reproducible benchmarks comparing model versions
- Temporal scope: 2023-2026, corresponding to the period of rapid LLM iteration
Ambiguities Identified¶
- "Prompt degradation" conflates two distinct phenomena: (a) model-side changes causing different responses to identical prompts, and (b) prompt drift where human modifications gradually degrade prompt quality. Both are relevant.
- "Published evidence" could mean peer-reviewed papers, preprints, technical blog posts, or vendor documentation. The research prioritizes peer-reviewed and preprint sources but includes industry evidence where rigorous.
- The question assumes degradation is negative, but model updates could also improve prompt performance. The research should capture both directions.
Sub-Questions¶
- Has the same prompt been shown to produce measurably different results across model versions in controlled studies?
- What is the magnitude of performance change documented (e.g., accuracy drops, format violations)?
- How quickly do these changes manifest (days, weeks, months)?
- Is prompt degradation a recognized phenomenon with a consistent name in the literature?
- What mechanisms cause prompt degradation (RLHF tuning, safety updates, architecture changes)?
Hypotheses¶
| ID | Hypothesis | Description |
|---|---|---|
| H1 | Strong published evidence documents prompt degradation as a real and significant phenomenon | Multiple peer-reviewed studies demonstrate measurable performance drops when identical prompts are applied to updated models, with specific metrics and reproducible results |
| H2 | Published evidence is sparse or anecdotal — prompt degradation is mainly an industry complaint | While practitioners report prompt degradation widely, formal published evidence is limited to one or two studies and industry blog posts, lacking the rigor of systematic investigation |
| H3 | Evidence exists but shows the phenomenon is complex — degradation in some dimensions accompanies improvement in others | Published studies show that model updates produce mixed results: some tasks degrade while others improve, making "degradation" an oversimplification of a more nuanced reality |