Skip to content

R0027/2026-03-26/Q002/H3

Statement

Challenges from linguistic structural differences exist but are primarily mediated through tokenization efficiency and training data representation rather than linguistic structure per se. The causal chain is: linguistic structure → tokenization/training data → performance impact.

Status

Current: Supported

H3 is the best-supported hypothesis. The evidence establishes a clear causal chain: languages with complex morphology, non-Latin scripts, or agglutinative structures require more tokens per word, which directly reduces accuracy (8-18pp per token/word) and increases cost (quadratic scaling). Meanwhile, direct linguistic nuances account for only ~2% of failures. The challenge is real and traceable to linguistic structure, but the mechanism is computational.

Supporting Evidence

Evidence Summary
SRC03-E01 Tokenization fertility predicts accuracy; morphological complexity → more tokens → lower accuracy
SRC04-E01 72-87% of failures from model limitations (tokenizer, latent space); ~2% from language nuances
SRC05-E01 Agglutinative languages break tokenization; workaround is translation to English
SRC02-E01 Arabic complexity affects even Arabic-centric models — the challenge is in processing, not understanding
SRC01-E02 Mandarin's tonal challenge is largely irrelevant for text — the textual challenge is about characters/tokenization

Contradicting Evidence

No evidence directly contradicts H3. All evidence is consistent with the mediated-mechanism model.

Reasoning

The evidence converges on a clear picture: linguistic structural features (SOV, agglutination, morphological richness, non-Latin scripts) are real and create real challenges, but the mechanism is not that models "cannot understand" these structures. Rather, the mechanism is that current tokenization systems impose a disproportionate computational cost on these languages, which reduces both accuracy and economic viability. This distinction matters for solutions — the fix is better tokenizers and more representative training data, not fundamentally different model architectures.

Relationship to Other Hypotheses

H3 integrates H1 (challenges exist) and H2 (computation dominates) into a coherent causal model. It agrees with H1 that linguistic structure matters and with H2 that computation is the proximate cause, while adding the causal link between them.