Skip to content

R0023/2026-03-25/Q003 — ACH Matrix

Matrix

H1: Strong evidence of degradation H2: Sparse/anecdotal evidence H3: Complex mixed effects
SRC01-E01: GPT-4 84% to 51% accuracy drop ++ -- +
SRC01-E02: Mixed effects across task types - N/A ++
SRC02-E01: Stochastic variation in same model - + ++
SRC03-E01: Industry claims without data + + N/A

Legend: - ++ Strongly supports - + Supports - -- Strongly contradicts - - Contradicts - N/A Not applicable to this hypothesis

Diagnosticity Analysis

Most Diagnostic Evidence

Evidence ID Why Diagnostic
SRC01-E02 Mixed effects across tasks discriminates between H1 (uniform degradation) and H3 (complex reality)
SRC02-E01 Stochastic variation discriminates between H1 (all degradation is real) and H3 (some may be noise)

Least Diagnostic Evidence

Evidence ID Why Non-Diagnostic
SRC03-E01 Industry claims without data support both H1 and H2 depending on interpretation

Outcome

Hypothesis supported: H3 — the evidence shows mixed effects that make "degradation" an oversimplification.

Hypotheses eliminated: None fully eliminated.

Hypotheses inconclusive: H1 (partially supported — the phenomenon is real but the evidence base is narrow) and H2 (partially supported — the evidence IS sparse beyond one study, but that one study is rigorous).