Skip to content

R0027/2026-03-26/Q001 — ACH Matrix

Matrix

H1: Significant gap exists H2: No meaningful gap H3: Conditional gap
SRC01-E01: 36 papers studying multilingual prompting + -- +
SRC01-E02: Native prompts win on some tasks - N/A ++
SRC02-E01: Selective pre-translation outperforms + -- ++
SRC02-E02: 200%+ improvement for low-resource + -- ++
SRC03-E01: XLT achieves 10+ point improvement + -- ++
SRC04-E01: Scaling does not close gap ++ -- +
SRC05-E01: 30-point English-Swahili gap ++ -- +
SRC06-E01: English beats Arabic on Arabic model ++ -- +
SRC07-E01: Hindi 63.1%, Mandarin 64.6%, Arabic 67.4% vs English 70.9% ++ -- +
SRC07-E02: English prompts 72.7% vs translated 67.2% ++ -- +
SRC08-E01: 8-18pp per token/word; reasoning models narrow gap + -- ++

Legend:

  • ++ Strongly supports
  • + Supports
  • -- Strongly contradicts
  • - Contradicts
  • N/A Not applicable to this hypothesis

Diagnosticity Analysis

Most Diagnostic Evidence

Evidence ID Why Diagnostic
SRC01-E02 Contradicts H1 (simple gap) while supporting H3 (conditional) — discriminates between H1 and H3
SRC08-E01 Supports both H1 and H3 differently — the gap exists (H1) but reasoning models narrow it (H3)
SRC02-E01 Selective pre-translation creating task-dependent optimal strategies discriminates H1 from H3

Least Diagnostic Evidence

Evidence ID Why Non-Diagnostic
SRC01-E01 Supports both H1 and H3 equally — confirms research exists but does not discriminate
SRC04-E01 Supports H1 and weakly supports H3 — provides magnitude but not conditionality detail

Outcome

Hypothesis supported: H3 — The performance gap is real but conditional on language resource level, task type, model architecture, and prompting strategy

Hypotheses eliminated: H2 — No evidence supports equivalent performance across languages; every source contradicts it

Hypotheses inconclusive: H1 — Partially supported but oversimplifies; the gap is not a simple, uniform degradation