Skip to content

R0027/2026-03-26/Q001/SRC05

Research R0027 — Multilingual prompt engineering challenges
Run 2026-03-26
Query Q001
Search S02
Result S02-R02
Source SRC05

Xuan et al. — MMLU-ProX multilingual benchmark

Source

Field Value
Title MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation
Publisher arXiv
Author(s) Weihao Xuan, Rui Yang, et al.
Date 2025-03-13
URL https://arxiv.org/html/2503.10497v1
Type Benchmark paper

Summary

Dimension Rating
Reliability High
Relevance High
Bias: Missing data Low risk
Bias: Measurement Some concerns
Bias: Selective reporting Low risk
Bias: Randomization N/A — not an RCT
Bias: Protocol deviation N/A — not an RCT
Bias: COI/Funding Low risk

Rationale

Dimension Rationale
Reliability 11,829 identical questions per language enables direct comparison. Semi-automatic translation with expert verification. 25 models tested.
Relevance Covers Japanese, Chinese, Korean, Arabic, Hindi — all languages named in Q001. Provides the most precise cross-language performance data.
Bias flags Measurement concerns — translation-based benchmark may introduce artifacts. Performance gaps could partly reflect translation quality, not model capability.

Evidence Extracts

Evidence ID Summary
SRC05-E01 30-point English-Swahili gap; clear performance hierarchy across language families