C015¶


Research	R0028 — Prompt Engineering Claims
Run	2026-03-26
Claim	C015

Claim: A study from Stanford and Berkeley tracked GPT-4's behavior between March and June 2023 and documented accuracy dropping from 84% to 51% on certain tasks in three months.

BLUF: Confirmed. The paper 'How is ChatGPT's behavior changing over time?' by researchers from Stanford and UC Berkeley documented GPT-4's accuracy on prime number identification dropping from 84% to 51% between March and June 2023. The study also found an even more dramatic decline on the same task with chain-of-thought prompting (97.6% to 2.4%).

Probability: Almost certain (95-99%) | Confidence: High

Summary¶

Entity	Description
Claim Definition	Claim text, scope, status
Assessment	Full analytical product with reasoning chain
ACH Matrix	Evidence x hypotheses diagnosticity analysis
Self-Audit	ROBIS-adapted 4-domain process audit

Hypotheses¶

ID	Hypothesis	Status
H1	Claim is accurate — 84% to 51% drop documented	Supported
H2	Partially correct — the specific task matters	Inconclusive
H3	Claim is materially wrong	Eliminated

Searches¶

ID	Target	Results	Selected
S01	Primary search	10	3

Sources¶

Source	Description	Reliability	Relevance
SRC01	Chen et al. — How is ChatGPT's behavior changing over time?	High	High