E02¶


Research	R0023 — Counterproductive advice and prompt lifecycle
Run	2026-03-25
Query	Q003
Source	SRC01
Evidence	SRC01-E02
Type	Analytical

Performance changes were mixed across task types — not uniformly negative.

URL: https://arxiv.org/abs/2307.09009

Extract¶

While GPT-4 degraded dramatically on prime number identification and code formatting, it improved on multi-hop knowledge-intensive questions. GPT-3.5 showed the opposite pattern in some areas. The authors concluded: "the behavior of the 'same' LLM service can change substantially in a relatively short amount of time," and that degraded instruction-following capability was a common factor behind many of the behavior drifts.

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Partially supports	Degradation is real, but not universal across all tasks
H2	Contradicts	Systematic evidence, not anecdotal
H3	Strongly supports	The mixed pattern (some tasks improve, others degrade) is the key finding