Skip to content

R0023/2026-03-25/Q003/SRC01/E02

Research R0023 — Counterproductive advice and prompt lifecycle
Run 2026-03-25
Query Q003
Source SRC01
Evidence SRC01-E02
Type Analytical

Performance changes were mixed across task types — not uniformly negative.

URL: https://arxiv.org/abs/2307.09009

Extract

While GPT-4 degraded dramatically on prime number identification and code formatting, it improved on multi-hop knowledge-intensive questions. GPT-3.5 showed the opposite pattern in some areas. The authors concluded: "the behavior of the 'same' LLM service can change substantially in a relatively short amount of time," and that degraded instruction-following capability was a common factor behind many of the behavior drifts.

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Partially supports Degradation is real, but not universal across all tasks
H2 Contradicts Systematic evidence, not anecdotal
H3 Strongly supports The mixed pattern (some tasks improve, others degrade) is the key finding