R0027/2026-03-26/Q001/SRC05/E01¶
30-point English-Swahili performance gap with clear language family hierarchy
URL: https://arxiv.org/html/2503.10497v1
Extract¶
The best model (Qwen2.5-72B) achieves 70.3% on English but only 40.1% on Swahili — a 30.2-point gap. Performance follows a clear hierarchy: "English > European languages > East Asian languages > South Asian/low-resource languages." Even the best models show 20-30 point drops on low-resource languages. "Reasoning-enhanced training yields inconsistent benefits across languages."
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports | Provides the most precise quantification of the cross-language gap — up to 30 points |
| H2 | Contradicts | Unambiguous, large-magnitude differences across all tested languages |
| H3 | Supports | The gap varies by language family (European closer to English, South Asian furthest) |
Context¶
The performance hierarchy (English > European > East Asian > South Asian/low-resource) is consistent across multiple benchmarks, strengthening the finding's reliability.