Skip to content

R0027/2026-03-26/Q001/S02

WebSearch — LLM benchmark performance comparisons across non-English languages

Summary

Field Value
Source/Database WebSearch
Query terms "LLM performance benchmark non-English languages Japanese Mandarin Arabic Hindi prompt compliance accuracy"
Filters None
Results returned 10
Results selected 5
Results rejected 5

Selected Results

Result Title URL Rationale
S02-R01 BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models https://arxiv.org/html/2502.07346v1 17-language benchmark with quantified performance gaps across 6 capabilities
S02-R02 MMLU-ProX: A Multilingual Benchmark for Advanced LLM Evaluation https://arxiv.org/html/2503.10497v1 13-language benchmark with 11,829 questions per language enabling direct cross-linguistic comparison
S02-R03 Native vs Non-Native Language Prompting: A Comparative Analysis https://arxiv.org/html/2409.07054v1 Direct comparison of native vs English prompts across 197 experiments with Arabic
S02-R04 Multilingual Performance Biases of Large Language Models in Education https://arxiv.org/html/2504.17720v2 9-language educational task evaluation with per-language accuracy data
S02-R05 The Token Tax: Systematic Bias in Multilingual Tokenization https://arxiv.org/html/2509.05486v1 Quantifies tokenization cost disparities and their accuracy impact

Rejected Results

Result Title URL Rationale
S02-R06 Multilingual Evaluations in LLMs — a comparison (Medium) https://medium.com/@vbsowmya/multilingual-evaluations-in-llms-a-comparison-1d58b0fd9848 Blog post summarizing others' work, not primary research
S02-R07 awesome-multilingual-llm-benchmarks (GitHub) https://github.com/NaiveNeuron/awesome-multilingual-llm-benchmarks Curated link list, not primary research
S02-R08 Artificial Analysis Multilingual Model Benchmark https://artificialanalysis.ai/models/multilingual Commercial benchmark comparison tool, limited methodology transparency
S02-R09 List of Best LLMs for Translation (Crowdin) https://crowdin.com/blog/best-llms-for-translation Translation-focused, not prompt engineering effectiveness
S02-R10 Open Leaderboard for Japanese LLMs (HuggingFace) https://huggingface.co/blog/leaderboard-japanese Japanese-only leaderboard, limited cross-language comparison

Notes

Excellent benchmark data found. MMLU-ProX and BenchMAX provide the most rigorous cross-language comparisons. The Token Tax paper adds an important structural dimension to the performance gap story.