R0023/2026-03-25/Q001 — ACH Matrix¶
Matrix¶
| H1: Multiple techniques counterproductive | H2: Techniques generally beneficial | H3: Effectiveness contingent | |
|---|---|---|---|
| SRC01-E01: 58 techniques cataloged via PRISMA | N/A | + | + |
| SRC02-E01: CoT decreases accuracy in reasoning models | ++ | -- | ++ |
| SRC02-E02: CoT introduces errors on easy questions | ++ | -- | + |
| SRC03-E01: 9 negative effects from expert personas on MMLU-Pro | ++ | -- | ++ |
| SRC03-E02: Low-knowledge personas reduce accuracy | + | - | + |
| SRC03-E03: Domain-matched personas provide no benefit | + | -- | + |
| SRC04-E01: Expert persona 68.0% vs. base 71.6% (independent) | ++ | -- | ++ |
| SRC05-E01: 60-point per-question swings masked by aggregation | + | - | ++ |
Legend:
- ++ Strongly supports
- + Supports
- -- Strongly contradicts
- - Contradicts
- N/A Not applicable to this hypothesis
Diagnosticity Analysis¶
Most Diagnostic Evidence¶
| Evidence ID | Why Diagnostic |
|---|---|
| SRC04-E01 | Independent replication of persona failures — discriminates strongly between H2 (edge cases) and H1/H3 (systematic effects) |
| SRC02-E01 | Model-type dependency of CoT — discriminates between H1 (universal harm) and H3 (contingent effects) |
| SRC05-E01 | Aggregation masking — explains why H2 appears plausible from casual testing while being empirically wrong |
Least Diagnostic Evidence¶
| Evidence ID | Why Non-Diagnostic |
|---|---|
| SRC01-E01 | Taxonomic survey — consistent with all three hypotheses, does not discriminate |
| SRC03-E02 | Low-knowledge persona failure is expected and unsurprising, does not help distinguish H1 from H3 |
Outcome¶
Hypothesis supported: H3 — effectiveness is highly contingent on model, task, and context. The evidence consistently shows that the same technique produces different effects across models and conditions.
Hypotheses eliminated: H2 — the evidence is too consistent across independent studies to support the claim that counterproductive findings are mere edge cases.
Hypotheses inconclusive: H1 — partially supported. Multiple techniques are indeed counterproductive, but the counterproductive effects are context-dependent rather than universal, making H3 the more accurate characterization.