R0023/2026-03-25/Q003/SRC01/E02¶
Performance changes were mixed across task types — not uniformly negative.
URL: https://arxiv.org/abs/2307.09009
Extract¶
While GPT-4 degraded dramatically on prime number identification and code formatting, it improved on multi-hop knowledge-intensive questions. GPT-3.5 showed the opposite pattern in some areas. The authors concluded: "the behavior of the 'same' LLM service can change substantially in a relatively short amount of time," and that degraded instruction-following capability was a common factor behind many of the behavior drifts.
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Partially supports | Degradation is real, but not universal across all tasks |
| H2 | Contradicts | Systematic evidence, not anecdotal |
| H3 | Strongly supports | The mixed pattern (some tasks improve, others degrade) is the key finding |