R0027/2026-03-26/Q002/SRC03/E01¶
Morphological complexity creates a compounding tokenization tax
URL: https://arxiv.org/html/2509.05486v1
Extract¶
"Tokenization inefficiency imposes structural disadvantages on morphologically complex, low-resource languages, inflating compute resources and depressing accuracy." For agglutinative and highly inflected languages, fertility (tokens per word) is higher, and "each additional token per word reduces accuracy by 8-18 percentage points." The relationship is compounding: "doubling tokens results in 4x increases in training cost and time due to quadratic O(n^2) attention scaling."
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports | Morphological complexity creates measurable challenges through tokenization |
| H2 | Contradicts | Clear causal pathway from linguistic structure to performance |
| H3 | Supports | The challenge is mediated through tokenization, not linguistic structure directly |
Context¶
This finding is critical for Q002: it shows that linguistic structural features like agglutination and inflection do create challenges, but the mechanism is tokenization rather than some inability of models to understand these structures in principle.