R0027/2026-03-26/Q001/SRC05
Xuan et al. — MMLU-ProX multilingual benchmark
Source
| Field |
Value |
| Title |
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation |
| Publisher |
arXiv |
| Author(s) |
Weihao Xuan, Rui Yang, et al. |
| Date |
2025-03-13 |
| URL |
https://arxiv.org/html/2503.10497v1 |
| Type |
Benchmark paper |
Summary
| Dimension |
Rating |
| Reliability |
High |
| Relevance |
High |
| Bias: Missing data |
Low risk |
| Bias: Measurement |
Some concerns |
| Bias: Selective reporting |
Low risk |
| Bias: Randomization |
N/A — not an RCT |
| Bias: Protocol deviation |
N/A — not an RCT |
| Bias: COI/Funding |
Low risk |
Rationale
| Dimension |
Rationale |
| Reliability |
11,829 identical questions per language enables direct comparison. Semi-automatic translation with expert verification. 25 models tested. |
| Relevance |
Covers Japanese, Chinese, Korean, Arabic, Hindi — all languages named in Q001. Provides the most precise cross-language performance data. |
| Bias flags |
Measurement concerns — translation-based benchmark may introduce artifacts. Performance gaps could partly reflect translation quality, not model capability. |
| Evidence ID |
Summary |
| SRC05-E01 |
30-point English-Swahili gap; clear performance hierarchy across language families |