R0027/2026-03-26/Q001/SRC08
Lundin et al. — The Token Tax: tokenization bias in multilingual LLMs
Source
| Field |
Value |
| Title |
The Token Tax: Systematic Bias in Multilingual Tokenization |
| Publisher |
arXiv |
| Author(s) |
Jessica M. Lundin, Ada Zhang, Nihal Karim, Hamza Louzan, Victor Wei, David Adelani, Cody Carroll |
| Date |
2025-09 |
| URL |
https://arxiv.org/html/2509.05486v1 |
| Type |
Research paper |
Summary
| Dimension |
Rating |
| Reliability |
High |
| Relevance |
Medium-High |
| Bias: Missing data |
Some concerns |
| Bias: Measurement |
Low risk |
| Bias: Selective reporting |
Low risk |
| Bias: Randomization |
N/A — not an RCT |
| Bias: Protocol deviation |
N/A — not an RCT |
| Bias: COI/Funding |
Low risk |
Rationale
| Dimension |
Rationale |
| Reliability |
Quantitative analysis of 10 LLMs across 16 African languages. Clear causal mechanism identified. |
| Relevance |
Explains the structural root cause of cross-language performance gaps (tokenization). Slightly less relevant because it focuses on African languages specifically. |
| Bias flags |
Some concerns about missing data — focuses on African languages, may not fully generalize to Asian/Middle Eastern languages asked about in Q001. |
| Evidence ID |
Summary |
| SRC08-E01 |
Each additional token per word reduces accuracy by 8-18 percentage points; 4x cost multiplier |