Skip to content

R0028/2026-03-26/C022 — Claim Definition

Claim as Received

Published research documents performance gaps of 3 to 30 percentage points between English and non-English languages, depending on the language and task. Arabic shows the smallest gap (3 points); low-resource languages show the largest (30 points).

Claim as Clarified

Partially correct. Research confirms significant performance gaps between English and non-English languages in LLMs. The LILT analysis found model limitations drive 72-87% of errors. However, the specific claim that Arabic shows the smallest gap (3 points) is contradicted by evidence showing Arabic actually requires 3x more tokens than English and can collapse to much lower performance. The 3-30 point range is broadly consistent with documented gaps.

BLUF

Partially correct. Research confirms significant performance gaps between English and non-English languages in LLMs. The LILT analysis found model limitations drive 72-87% of errors. However, the specific claim that Arabic shows the smallest gap (3 points) is contradicted by evidence showing Arabic actually requires 3x more tokens than English and can collapse to much lower performance. The 3-30 point range is broadly consistent with documented gaps.

Scope

  • Domain: Prompt engineering and related fields
  • Timeframe: As of 2026-03-26
  • Testability: Verifiable through primary sources

Assessment Summary

Probability: Likely (55-80%)

Confidence: Medium

Hypothesis outcome: See assessment.md.

[Full assessment in assessment.md.]

Status

Field Value
Date created 2026-03-26
Date completed 2026-03-26
Researcher profile None provided
Prompt version Unified Research Standard v1.0-draft
Revisit by 2027-03-26
Revisit trigger New evidence or source changes