R0023/2026-03-25/Q003 — ACH Matrix¶
Matrix¶
| H1: Strong evidence of degradation | H2: Sparse/anecdotal evidence | H3: Complex mixed effects | |
|---|---|---|---|
| SRC01-E01: GPT-4 84% to 51% accuracy drop | ++ | -- | + |
| SRC01-E02: Mixed effects across task types | - | N/A | ++ |
| SRC02-E01: Stochastic variation in same model | - | + | ++ |
| SRC03-E01: Industry claims without data | + | + | N/A |
Legend:
- ++ Strongly supports
- + Supports
- -- Strongly contradicts
- - Contradicts
- N/A Not applicable to this hypothesis
Diagnosticity Analysis¶
Most Diagnostic Evidence¶
| Evidence ID | Why Diagnostic |
|---|---|
| SRC01-E02 | Mixed effects across tasks discriminates between H1 (uniform degradation) and H3 (complex reality) |
| SRC02-E01 | Stochastic variation discriminates between H1 (all degradation is real) and H3 (some may be noise) |
Least Diagnostic Evidence¶
| Evidence ID | Why Non-Diagnostic |
|---|---|
| SRC03-E01 | Industry claims without data support both H1 and H2 depending on interpretation |
Outcome¶
Hypothesis supported: H3 — the evidence shows mixed effects that make "degradation" an oversimplification.
Hypotheses eliminated: None fully eliminated.
Hypotheses inconclusive: H1 (partially supported — the phenomenon is real but the evidence base is narrow) and H2 (partially supported — the evidence IS sparse beyond one study, but that one study is rigorous).