Skip to content

R0027/2026-03-26/Q001/SRC04/E01

Research R0027 — Multilingual prompt engineering challenges
Run 2026-03-26
Query Q001
Source SRC04
Evidence SRC04-E01
Type Statistical

High-resource languages consistently outperform low-resource; model scaling does not close the gap

URL: https://arxiv.org/html/2502.07346v1

Extract

"High-resource languages such as French and Chinese consistently outperform low-resource languages like Telugu, Swahili, and Bengali." DeepSeek-V3 showed 50%+ accuracy in science reasoning for English/French but dropped below 40% for Telugu. Critically, "the proportion of larger models achieving smaller GAPs only slightly exceeds 0.5 for most model families" — meaning model size increases do not reliably reduce the cross-language performance gap.

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Supports 10+ percentage point gaps confirmed across 17 languages
H2 Contradicts Clear, consistent performance hierarchy documented
H3 Supports Gap magnitude varies by language resource level and task type

Context

The finding that model scaling does not close the gap is significant — it suggests the problem is structural (training data, tokenization) rather than simply a matter of model capacity.