Q003 — Query Definition¶


Research	R0023 — Counterproductive advice and prompt lifecycle
Run	2026-03-25
Query	Q003

Query as Received¶

What published evidence exists on prompt degradation over time — prompts that worked with one model version failing or producing different results after model updates?

Query as Clarified¶

Subject: The phenomenon of previously effective prompts producing degraded results after model updates
Scope: Published empirical evidence documenting measurable performance changes when the same prompts are applied to different versions of the same model
Evidence basis: Peer-reviewed papers, technical reports, reproducible benchmarks comparing model versions
Temporal scope: 2023-2026, corresponding to the period of rapid LLM iteration

Ambiguities Identified¶

"Prompt degradation" conflates two distinct phenomena: (a) model-side changes causing different responses to identical prompts, and (b) prompt drift where human modifications gradually degrade prompt quality. Both are relevant.
"Published evidence" could mean peer-reviewed papers, preprints, technical blog posts, or vendor documentation. The research prioritizes peer-reviewed and preprint sources but includes industry evidence where rigorous.
The question assumes degradation is negative, but model updates could also improve prompt performance. The research should capture both directions.

Sub-Questions¶

Has the same prompt been shown to produce measurably different results across model versions in controlled studies?
What is the magnitude of performance change documented (e.g., accuracy drops, format violations)?
How quickly do these changes manifest (days, weeks, months)?
Is prompt degradation a recognized phenomenon with a consistent name in the literature?
What mechanisms cause prompt degradation (RLHF tuning, safety updates, architecture changes)?

Hypotheses¶

ID	Hypothesis	Description
H1	Strong published evidence documents prompt degradation as a real and significant phenomenon	Multiple peer-reviewed studies demonstrate measurable performance drops when identical prompts are applied to updated models, with specific metrics and reproducible results
H2	Published evidence is sparse or anecdotal — prompt degradation is mainly an industry complaint	While practitioners report prompt degradation widely, formal published evidence is limited to one or two studies and industry blog posts, lacking the rigor of systematic investigation
H3	Evidence exists but shows the phenomenon is complex — degradation in some dimensions accompanies improvement in others	Published studies show that model updates produce mixed results: some tasks degrade while others improve, making "degradation" an oversimplification of a more nuanced reality