R0023/2026-03-25/Q003/SRC01/E01¶
GPT-4 prime number identification accuracy dropped from 84% to 51% between March and June 2023.
URL: https://arxiv.org/abs/2307.09009
Extract¶
Lingjiao Chen, Matei Zaharia, and James Zou (Stanford/Berkeley) compared GPT-3.5 and GPT-4 across March 2023 and June 2023 versions on 7 task categories. Key findings:
- Prime number identification: GPT-4 accuracy dropped from 84% (March) to 51% (June) — near random chance
- Code generation: Both models produced more formatting errors in June
- Sensitive questions: GPT-4 became increasingly reluctant to respond
- Chain-of-thought responsiveness: GPT-4 showed reduced responsiveness to CoT by June
- Multi-hop questions: GPT-4 improved; GPT-3.5 declined (mixed effects)
The changes occurred within approximately 3 months, demonstrating that prompt degradation can happen rapidly.
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports | 33-percentage-point accuracy drop is dramatic, measurable, reproducible evidence of prompt degradation |
| H2 | Contradicts | This is a rigorous academic study, not anecdotal |
| H3 | Supports | Mixed results across tasks (some improved, some degraded) demonstrate complexity |
Context¶
This study was groundbreaking when published and generated significant media coverage. It provided the first rigorous empirical evidence for what practitioners had been reporting anecdotally: the same prompts produce different results across model versions.