C022¶


Research	R0028 — Prompt Engineering Claims
Run	2026-03-26
Claim	C022

Claim: Published research documents performance gaps of 3 to 30 percentage points between English and non-English languages, depending on the language and task. Arabic shows the smallest gap (3 points); low-resource languages show the largest (30 points).

BLUF: Partially correct. Research confirms significant performance gaps between English and non-English languages in LLMs. The LILT analysis found model limitations drive 72-87% of errors. However, the specific claim that Arabic shows the smallest gap (3 points) is contradicted by evidence showing Arabic actually requires 3x more tokens than English and can collapse to much lower performance. The 3-30 point range is broadly consistent with documented gaps.

Probability: Likely (55-80%) | Confidence: Medium

Correction needed: The characterization of Arabic showing the 'smallest gap' contradicts evidence showing Arabic requires 3x more tokens than English and sometimes collapses to significantly lower accuracy.

Summary¶

Entity	Description
Claim Definition	Claim text, scope, status
Assessment	Full analytical product with reasoning chain
ACH Matrix	Evidence x hypotheses diagnosticity analysis
Self-Audit	ROBIS-adapted 4-domain process audit

Hypotheses¶

ID	Hypothesis	Status
H1	Claim is accurate including Arabic having smallest gap	Inconclusive
H2	Performance gaps are real and in the documented range, but Arabic having the smallest gap is not supported	Supported
H3	Claim is materially wrong	Eliminated

Searches¶

ID	Target	Results	Selected
S01	Primary search	10	3

Sources¶

Source	Description	Reliability	Relevance
SRC01	LILT Multilingual LLM Performance Gap Analysis	High	High