R0027/2026-03-26/Q001/S02
WebSearch — LLM benchmark performance comparisons across non-English languages
Summary
| Field |
Value |
| Source/Database |
WebSearch |
| Query terms |
"LLM performance benchmark non-English languages Japanese Mandarin Arabic Hindi prompt compliance accuracy" |
| Filters |
None |
| Results returned |
10 |
| Results selected |
5 |
| Results rejected |
5 |
Selected Results
| Result |
Title |
URL |
Rationale |
| S02-R01 |
BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models |
https://arxiv.org/html/2502.07346v1 |
17-language benchmark with quantified performance gaps across 6 capabilities |
| S02-R02 |
MMLU-ProX: A Multilingual Benchmark for Advanced LLM Evaluation |
https://arxiv.org/html/2503.10497v1 |
13-language benchmark with 11,829 questions per language enabling direct cross-linguistic comparison |
| S02-R03 |
Native vs Non-Native Language Prompting: A Comparative Analysis |
https://arxiv.org/html/2409.07054v1 |
Direct comparison of native vs English prompts across 197 experiments with Arabic |
| S02-R04 |
Multilingual Performance Biases of Large Language Models in Education |
https://arxiv.org/html/2504.17720v2 |
9-language educational task evaluation with per-language accuracy data |
| S02-R05 |
The Token Tax: Systematic Bias in Multilingual Tokenization |
https://arxiv.org/html/2509.05486v1 |
Quantifies tokenization cost disparities and their accuracy impact |
Rejected Results
| Result |
Title |
URL |
Rationale |
| S02-R06 |
Multilingual Evaluations in LLMs — a comparison (Medium) |
https://medium.com/@vbsowmya/multilingual-evaluations-in-llms-a-comparison-1d58b0fd9848 |
Blog post summarizing others' work, not primary research |
| S02-R07 |
awesome-multilingual-llm-benchmarks (GitHub) |
https://github.com/NaiveNeuron/awesome-multilingual-llm-benchmarks |
Curated link list, not primary research |
| S02-R08 |
Artificial Analysis Multilingual Model Benchmark |
https://artificialanalysis.ai/models/multilingual |
Commercial benchmark comparison tool, limited methodology transparency |
| S02-R09 |
List of Best LLMs for Translation (Crowdin) |
https://crowdin.com/blog/best-llms-for-translation |
Translation-focused, not prompt engineering effectiveness |
| S02-R10 |
Open Leaderboard for Japanese LLMs (HuggingFace) |
https://huggingface.co/blog/leaderboard-japanese |
Japanese-only leaderboard, limited cross-language comparison |
Notes
Excellent benchmark data found. MMLU-ProX and BenchMAX provide the most rigorous cross-language comparisons. The Token Tax paper adds an important structural dimension to the performance gap story.