R0027/2026-03-26/Q001/SRC04
Huang et al. — BenchMAX 17-language evaluation suite
Source
| Field |
Value |
| Title |
BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models |
| Publisher |
ICML 2025 |
| Author(s) |
Xu Huang, Wenhao Zhu, Hanxu Hu, Conghui He, Lei Li, Shujian Huang, Fei Yuan |
| Date |
2025-02 |
| URL |
https://arxiv.org/html/2502.07346v1 |
| Type |
Benchmark paper (peer-reviewed) |
Summary
| Dimension |
Rating |
| Reliability |
High |
| Relevance |
High |
| Bias: Missing data |
Low risk |
| Bias: Measurement |
Some concerns |
| Bias: Selective reporting |
Low risk |
| Bias: Randomization |
N/A — not an RCT |
| Bias: Protocol deviation |
N/A — not an RCT |
| Bias: COI/Funding |
Low risk |
Rationale
| Dimension |
Rationale |
| Reliability |
ICML-accepted benchmark with human post-editing by 3 native speakers per sample. High methodological rigor. |
| Relevance |
17 languages including Japanese, Arabic, Korean, Chinese. 6 capabilities tested. Directly quantifies cross-language gaps. |
| Bias flags |
Some measurement concerns — GPT-4o-mini used for final version selection, which could introduce model-specific bias. |
| Evidence ID |
Summary |
| SRC04-E01 |
High-resource languages consistently outperform low-resource languages; scaling does not close the gap |