E01¶


Research	R0027 — Multilingual prompt engineering challenges
Run	2026-03-26
Query	Q001
Source	SRC05
Evidence	SRC05-E01
Type	Statistical

30-point English-Swahili performance gap with clear language family hierarchy

URL: https://arxiv.org/html/2503.10497v1

Extract¶

The best model (Qwen2.5-72B) achieves 70.3% on English but only 40.1% on Swahili — a 30.2-point gap. Performance follows a clear hierarchy: "English > European languages > East Asian languages > South Asian/low-resource languages." Even the best models show 20-30 point drops on low-resource languages. "Reasoning-enhanced training yields inconsistent benefits across languages."

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Supports	Provides the most precise quantification of the cross-language gap — up to 30 points
H2	Contradicts	Unambiguous, large-magnitude differences across all tested languages
H3	Supports	The gap varies by language family (European closer to English, South Asian furthest)

Context¶

The performance hierarchy (English > European > East Asian > South Asian/low-resource) is consistent across multiple benchmarks, strengthening the finding's reliability.