E01¶


Research	R0027 — Multilingual prompt engineering challenges
Run	2026-03-26
Query	Q001
Source	SRC08
Evidence	SRC08-E01
Type	Statistical

Tokenization fertility predicts accuracy: each additional token per word costs 8-18 percentage points

URL: https://arxiv.org/html/2509.05486v1

Extract¶

"Higher fertility (tokens per word) consistently predicts lower accuracy. Linear regressions show slopes ranging from -0.08 to -0.18, meaning each additional token per word reduces accuracy by 8-18 percentage points." Economically, "doubling tokens results in 4x increases in training cost and time due to quadratic O(n^2) attention scaling." Training Llama-3.1-405B costs $105M in English but $420M in a 2x fertility language. Reasoning models (DeepSeek, o1) narrow the gap by 8-12 points.

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Supports	Provides a causal mechanism for why performance varies across languages
H2	Contradicts	Demonstrates a structural, quantifiable cause of language-dependent performance
H3	Supports	Reasoning models partially mitigate the gap, confirming conditionality

Context¶

This paper provides the most compelling structural explanation for why non-English languages perform worse: tokenization inefficiency creates a compounding tax on both accuracy and cost. The relationship is linear and predictable, which makes it a testable mechanism rather than a vague correlation.