Skip to content

R0023/2026-03-25/Q003/SRC01/E01

Research R0023 — Counterproductive advice and prompt lifecycle
Run 2026-03-25
Query Q003
Source SRC01
Evidence SRC01-E01
Type Statistical

GPT-4 prime number identification accuracy dropped from 84% to 51% between March and June 2023.

URL: https://arxiv.org/abs/2307.09009

Extract

Lingjiao Chen, Matei Zaharia, and James Zou (Stanford/Berkeley) compared GPT-3.5 and GPT-4 across March 2023 and June 2023 versions on 7 task categories. Key findings:

  • Prime number identification: GPT-4 accuracy dropped from 84% (March) to 51% (June) — near random chance
  • Code generation: Both models produced more formatting errors in June
  • Sensitive questions: GPT-4 became increasingly reluctant to respond
  • Chain-of-thought responsiveness: GPT-4 showed reduced responsiveness to CoT by June
  • Multi-hop questions: GPT-4 improved; GPT-3.5 declined (mixed effects)

The changes occurred within approximately 3 months, demonstrating that prompt degradation can happen rapidly.

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Supports 33-percentage-point accuracy drop is dramatic, measurable, reproducible evidence of prompt degradation
H2 Contradicts This is a rigorous academic study, not anecdotal
H3 Supports Mixed results across tasks (some improved, some degraded) demonstrate complexity

Context

This study was groundbreaking when published and generated significant media coverage. It provided the first rigorous empirical evidence for what practitioners had been reporting anecdotally: the same prompts produce different results across model versions.