Skip to content

R0028/2026-03-26/C022

Claim: Published research documents performance gaps of 3 to 30 percentage points between English and non-English languages, depending on the language and task. Arabic shows the smallest gap (3 points); low-resource languages show the largest (30 points).

BLUF: Partially correct. Research confirms significant performance gaps between English and non-English languages in LLMs. The LILT analysis found model limitations drive 72-87% of errors. However, the specific claim that Arabic shows the smallest gap (3 points) is contradicted by evidence showing Arabic actually requires 3x more tokens than English and can collapse to much lower performance. The 3-30 point range is broadly consistent with documented gaps.

Probability: Likely (55-80%) | Confidence: Medium

Correction needed: The characterization of Arabic showing the 'smallest gap' contradicts evidence showing Arabic requires 3x more tokens than English and sometimes collapses to significantly lower accuracy.


Summary

Entity Description
Claim Definition Claim text, scope, status
Assessment Full analytical product with reasoning chain
ACH Matrix Evidence x hypotheses diagnosticity analysis
Self-Audit ROBIS-adapted 4-domain process audit

Hypotheses

ID Hypothesis Status
H1 Claim is accurate including Arabic having smallest gap Inconclusive
H2 Performance gaps are real and in the documented range, but Arabic having the smallest gap is not supported Supported
H3 Claim is materially wrong Eliminated

Searches

ID Target Results Selected
S01 Primary search 10 3

Sources

Source Description Reliability Relevance
SRC01 LILT Multilingual LLM Performance Gap Analysis High High