Skip to content

R0027/2026-03-26/Q002/S02

WebSearch — Tokenization bias and cost for non-English languages with different structures

Summary

Field Value
Source/Database WebSearch
Query terms "tokenization bias non-English languages LLM Arabic Japanese Chinese cost tokens per word 2024 2025"
Filters None
Results returned 10
Results selected 5
Results rejected 5

Selected Results

Result Title URL Rationale
S02-R01 The Token Tax: Systematic Bias in Multilingual Tokenization https://arxiv.org/html/2509.05486v1 Quantifies tokenization cost per language structure
S02-R02 Language Model Tokenizers Introduce Unfairness Between Languages https://arxiv.org/pdf/2305.15425 Foundational paper on tokenizer unfairness
S02-R03 Multilingual Tokenization Advances https://www.emergentmind.com/topics/multilingual-tokenization Overview of tokenization challenges and solutions
S02-R04 Why LLM Performance Drops in Non-English Languages (LILT) https://lilt.com/blog/multilingual-llm-performance-gap-analysis Root cause analysis of performance drops
S02-R05 Problematic Tokens: Tokenizer Bias in Large Language Models https://arxiv.org/html/2406.11214 Specific problematic token patterns

Rejected Results

Result Title URL Rationale
S02-R06 Tokenization efficiency for Ukrainian language https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2025.1538165/full Single-language focus, Ukrainian not asked about
S02-R07 Do All Languages Cost the Same? (ACL) https://aclanthology.org/2023.emnlp-main.614.pdf Relevant but could not access full text; findings captured via other sources
S02-R08 Understanding Tokenization in LLMs (Medium) https://medium.com/@jhoansfuentes1999/understanding-tokenization-in-llms General tutorial, not research
S02-R09 What is Tokenization in AI? (AI21) https://www.ai21.com/knowledge/tokenization/ Commercial explainer, not research
S02-R10 LLM tokens and foreign languages (blog) https://ikriv.com/blog/?p=5322 Personal blog, anecdotal

Notes

Tokenization research is more mature than linguistic-structure-specific prompt engineering research. The evidence points strongly to tokenization as the primary mediator of linguistic structure effects.