Skip to content

R0027/2026-03-26/Q001/SRC04

Research R0027 — Multilingual prompt engineering challenges
Run 2026-03-26
Query Q001
Search S02
Result S02-R01
Source SRC04

Huang et al. — BenchMAX 17-language evaluation suite

Source

Field Value
Title BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models
Publisher ICML 2025
Author(s) Xu Huang, Wenhao Zhu, Hanxu Hu, Conghui He, Lei Li, Shujian Huang, Fei Yuan
Date 2025-02
URL https://arxiv.org/html/2502.07346v1
Type Benchmark paper (peer-reviewed)

Summary

Dimension Rating
Reliability High
Relevance High
Bias: Missing data Low risk
Bias: Measurement Some concerns
Bias: Selective reporting Low risk
Bias: Randomization N/A — not an RCT
Bias: Protocol deviation N/A — not an RCT
Bias: COI/Funding Low risk

Rationale

Dimension Rationale
Reliability ICML-accepted benchmark with human post-editing by 3 native speakers per sample. High methodological rigor.
Relevance 17 languages including Japanese, Arabic, Korean, Chinese. 6 capabilities tested. Directly quantifies cross-language gaps.
Bias flags Some measurement concerns — GPT-4o-mini used for final version selection, which could introduce model-specific bias.

Evidence Extracts

Evidence ID Summary
SRC04-E01 High-resource languages consistently outperform low-resource languages; scaling does not close the gap