R0027/2026-03-26/Q002/S02
WebSearch — Tokenization bias and cost for non-English languages with different structures
Summary
| Field |
Value |
| Source/Database |
WebSearch |
| Query terms |
"tokenization bias non-English languages LLM Arabic Japanese Chinese cost tokens per word 2024 2025" |
| Filters |
None |
| Results returned |
10 |
| Results selected |
5 |
| Results rejected |
5 |
Selected Results
| Result |
Title |
URL |
Rationale |
| S02-R01 |
The Token Tax: Systematic Bias in Multilingual Tokenization |
https://arxiv.org/html/2509.05486v1 |
Quantifies tokenization cost per language structure |
| S02-R02 |
Language Model Tokenizers Introduce Unfairness Between Languages |
https://arxiv.org/pdf/2305.15425 |
Foundational paper on tokenizer unfairness |
| S02-R03 |
Multilingual Tokenization Advances |
https://www.emergentmind.com/topics/multilingual-tokenization |
Overview of tokenization challenges and solutions |
| S02-R04 |
Why LLM Performance Drops in Non-English Languages (LILT) |
https://lilt.com/blog/multilingual-llm-performance-gap-analysis |
Root cause analysis of performance drops |
| S02-R05 |
Problematic Tokens: Tokenizer Bias in Large Language Models |
https://arxiv.org/html/2406.11214 |
Specific problematic token patterns |
Rejected Results
| Result |
Title |
URL |
Rationale |
| S02-R06 |
Tokenization efficiency for Ukrainian language |
https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2025.1538165/full |
Single-language focus, Ukrainian not asked about |
| S02-R07 |
Do All Languages Cost the Same? (ACL) |
https://aclanthology.org/2023.emnlp-main.614.pdf |
Relevant but could not access full text; findings captured via other sources |
| S02-R08 |
Understanding Tokenization in LLMs (Medium) |
https://medium.com/@jhoansfuentes1999/understanding-tokenization-in-llms |
General tutorial, not research |
| S02-R09 |
What is Tokenization in AI? (AI21) |
https://www.ai21.com/knowledge/tokenization/ |
Commercial explainer, not research |
| S02-R10 |
LLM tokens and foreign languages (blog) |
https://ikriv.com/blog/?p=5322 |
Personal blog, anecdotal |
Notes
Tokenization research is more mature than linguistic-structure-specific prompt engineering research. The evidence points strongly to tokenization as the primary mediator of linguistic structure effects.