Q001 — ACH Matrix¶


Research	R0027 — Multilingual prompt engineering challenges
Run	2026-03-26
Query	Q001

Matrix¶

	H1: Significant gap exists	H2: No meaningful gap	H3: Conditional gap
SRC01-E01: 36 papers studying multilingual prompting	+	--	+
SRC01-E02: Native prompts win on some tasks	-	N/A	++
SRC02-E01: Selective pre-translation outperforms	+	--	++
SRC02-E02: 200%+ improvement for low-resource	+	--	++
SRC03-E01: XLT achieves 10+ point improvement	+	--	++
SRC04-E01: Scaling does not close gap	++	--	+
SRC05-E01: 30-point English-Swahili gap	++	--	+
SRC06-E01: English beats Arabic on Arabic model	++	--	+
SRC07-E01: Hindi 63.1%, Mandarin 64.6%, Arabic 67.4% vs English 70.9%	++	--	+
SRC07-E02: English prompts 72.7% vs translated 67.2%	++	--	+
SRC08-E01: 8-18pp per token/word; reasoning models narrow gap	+	--	++

Legend:

++ Strongly supports
+ Supports
-- Strongly contradicts
- Contradicts
N/A Not applicable to this hypothesis

Diagnosticity Analysis¶

Most Diagnostic Evidence¶

Evidence ID	Why Diagnostic
SRC01-E02	Contradicts H1 (simple gap) while supporting H3 (conditional) — discriminates between H1 and H3
SRC08-E01	Supports both H1 and H3 differently — the gap exists (H1) but reasoning models narrow it (H3)
SRC02-E01	Selective pre-translation creating task-dependent optimal strategies discriminates H1 from H3

Least Diagnostic Evidence¶

Evidence ID	Why Non-Diagnostic
SRC01-E01	Supports both H1 and H3 equally — confirms research exists but does not discriminate
SRC04-E01	Supports H1 and weakly supports H3 — provides magnitude but not conditionality detail

Outcome¶

Hypothesis supported: H3 — The performance gap is real but conditional on language resource level, task type, model architecture, and prompting strategy

Hypotheses eliminated: H2 — No evidence supports equivalent performance across languages; every source contradicts it

Hypotheses inconclusive: H1 — Partially supported but oversimplifies; the gap is not a simple, uniform degradation