E01¶


Research	R0023 — Counterproductive advice and prompt lifecycle
Run	2026-03-25
Query	Q003
Source	SRC01
Evidence	SRC01-E01
Type	Statistical

GPT-4 prime number identification accuracy dropped from 84% to 51% between March and June 2023.

URL: https://arxiv.org/abs/2307.09009

Extract¶

Lingjiao Chen, Matei Zaharia, and James Zou (Stanford/Berkeley) compared GPT-3.5 and GPT-4 across March 2023 and June 2023 versions on 7 task categories. Key findings:

Prime number identification: GPT-4 accuracy dropped from 84% (March) to 51% (June) — near random chance
Code generation: Both models produced more formatting errors in June
Sensitive questions: GPT-4 became increasingly reluctant to respond
Chain-of-thought responsiveness: GPT-4 showed reduced responsiveness to CoT by June
Multi-hop questions: GPT-4 improved; GPT-3.5 declined (mixed effects)

The changes occurred within approximately 3 months, demonstrating that prompt degradation can happen rapidly.

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Supports	33-percentage-point accuracy drop is dramatic, measurable, reproducible evidence of prompt degradation
H2	Contradicts	This is a rigorous academic study, not anecdotal
H3	Supports	Mixed results across tasks (some improved, some degraded) demonstrate complexity

Context¶

This study was groundbreaking when published and generated significant media coverage. It provided the first rigorous empirical evidence for what practitioners had been reporting anecdotally: the same prompts produce different results across model versions.