R0023/2026-03-25/Q003/H3¶


Research	R0023 — Counterproductive advice and prompt lifecycle
Run	2026-03-25
Query	Q003
Hypothesis	H3

Statement¶

Evidence exists but shows the phenomenon is complex — degradation in some dimensions accompanies improvement in others, making "degradation" an oversimplification of a more nuanced reality.

Status¶

Current: Supported

Chen et al. explicitly found mixed effects: GPT-4 degraded on prime number identification and code formatting but improved on multi-hop knowledge questions. The Wharton variability study adds another dimension: even within a single model version, identical prompts produce inconsistent results, meaning the baseline against which "degradation" is measured is inherently noisy. Together, these findings support H3's characterization that the reality is more complex than simple degradation.

Supporting Evidence¶

Evidence	Summary
SRC01-E02	Mixed effects: some tasks degraded, others improved across model versions
SRC02-E01	Baseline stochastic variation complicates degradation detection
SRC01-E01	Degraded instruction-following identified as common factor, but not uniform

Contradicting Evidence¶

No evidence directly contradicts H3. All findings are consistent with the mixed-effects characterization.

Reasoning¶

H3 provides the most accurate description of the evidence. The Chen et al. study — the strongest evidence available — explicitly documents both degradation and improvement across different tasks. This is not a case where the evidence is ambiguous; the evidence itself shows mixed effects.

Relationship to Other Hypotheses¶

H3 reconciles H1 and H2: prompt degradation is real (supporting H1's core claim) but the evidence base is narrow (supporting H2's observation about sparse research), and the reality is more complex than either H1 or H2 alone suggests.