R0027/2026-03-26/Q001/SRC08/E01¶
Tokenization fertility predicts accuracy: each additional token per word costs 8-18 percentage points
URL: https://arxiv.org/html/2509.05486v1
Extract¶
"Higher fertility (tokens per word) consistently predicts lower accuracy. Linear regressions show slopes ranging from -0.08 to -0.18, meaning each additional token per word reduces accuracy by 8-18 percentage points." Economically, "doubling tokens results in 4x increases in training cost and time due to quadratic O(n^2) attention scaling." Training Llama-3.1-405B costs $105M in English but $420M in a 2x fertility language. Reasoning models (DeepSeek, o1) narrow the gap by 8-12 points.
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports | Provides a causal mechanism for why performance varies across languages |
| H2 | Contradicts | Demonstrates a structural, quantifiable cause of language-dependent performance |
| H3 | Supports | Reasoning models partially mitigate the gap, confirming conditionality |
Context¶
This paper provides the most compelling structural explanation for why non-English languages perform worse: tokenization inefficiency creates a compounding tax on both accuracy and cost. The relationship is linear and predictable, which makes it a testable mechanism rather than a vague correlation.