Skip to content

R0027/2026-03-26/Q001/SRC08

Research R0027 — Multilingual prompt engineering challenges
Run 2026-03-26
Query Q001
Search S02
Result S02-R05
Source SRC08

Lundin et al. — The Token Tax: tokenization bias in multilingual LLMs

Source

Field Value
Title The Token Tax: Systematic Bias in Multilingual Tokenization
Publisher arXiv
Author(s) Jessica M. Lundin, Ada Zhang, Nihal Karim, Hamza Louzan, Victor Wei, David Adelani, Cody Carroll
Date 2025-09
URL https://arxiv.org/html/2509.05486v1
Type Research paper

Summary

Dimension Rating
Reliability High
Relevance Medium-High
Bias: Missing data Some concerns
Bias: Measurement Low risk
Bias: Selective reporting Low risk
Bias: Randomization N/A — not an RCT
Bias: Protocol deviation N/A — not an RCT
Bias: COI/Funding Low risk

Rationale

Dimension Rationale
Reliability Quantitative analysis of 10 LLMs across 16 African languages. Clear causal mechanism identified.
Relevance Explains the structural root cause of cross-language performance gaps (tokenization). Slightly less relevant because it focuses on African languages specifically.
Bias flags Some concerns about missing data — focuses on African languages, may not fully generalize to Asian/Middle Eastern languages asked about in Q001.

Evidence Extracts

Evidence ID Summary
SRC08-E01 Each additional token per word reduces accuracy by 8-18 percentage points; 4x cost multiplier