Skip to content

R0027/2026-03-26/Q001/SRC08/E01

Research R0027 — Multilingual prompt engineering challenges
Run 2026-03-26
Query Q001
Source SRC08
Evidence SRC08-E01
Type Statistical

Tokenization fertility predicts accuracy: each additional token per word costs 8-18 percentage points

URL: https://arxiv.org/html/2509.05486v1

Extract

"Higher fertility (tokens per word) consistently predicts lower accuracy. Linear regressions show slopes ranging from -0.08 to -0.18, meaning each additional token per word reduces accuracy by 8-18 percentage points." Economically, "doubling tokens results in 4x increases in training cost and time due to quadratic O(n^2) attention scaling." Training Llama-3.1-405B costs $105M in English but $420M in a 2x fertility language. Reasoning models (DeepSeek, o1) narrow the gap by 8-12 points.

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Supports Provides a causal mechanism for why performance varies across languages
H2 Contradicts Demonstrates a structural, quantifiable cause of language-dependent performance
H3 Supports Reasoning models partially mitigate the gap, confirming conditionality

Context

This paper provides the most compelling structural explanation for why non-English languages perform worse: tokenization inefficiency creates a compounding tax on both accuracy and cost. The relationship is linear and predictable, which makes it a testable mechanism rather than a vague correlation.