R0027/2026-03-26/Q001/S02/R05¶
Quantification of tokenization bias as a systematic tax on non-English languages
Summary¶
| Field | Value |
|---|---|
| Title | The Token Tax: Systematic Bias in Multilingual Tokenization |
| URL | https://arxiv.org/html/2509.05486v1 |
| Date accessed | 2026-03-26 |
| Publication date | 2025-09 |
| Author(s) | Jessica M. Lundin, Ada Zhang, Nihal Karim, Hamza Louzan, Victor Wei, David Adelani, Cody Carroll |
| Publication | arXiv preprint |
Selection Decision¶
Included in evidence base: Yes
Rationale: Quantifies the structural mechanism (tokenization) behind performance degradation. Shows each additional token per word reduces accuracy by 8-18 percentage points. Critical for understanding root causes.