Skip to content

R0027/2026-03-26/Q001/H1

Statement

Significant, well-documented performance gaps exist between English and non-English prompt engineering effectiveness. Published research consistently shows measurable degradation for non-English languages.

Status

Current: Partially supported

H1 is partially supported because the evidence overwhelmingly confirms that performance gaps exist and are well-documented. However, the framing of H1 as a simple, consistent degradation oversimplifies what the evidence shows. The gap is real but its magnitude and direction are conditional on task type, language resource level, model architecture, and prompting strategy. H3 better captures the evidence pattern.

Supporting Evidence

Evidence Summary
SRC05-E01 30-point English-Swahili gap on MMLU-ProX with clear language hierarchy
SRC04-E01 High-resource languages consistently outperform low-resource; scaling does not close gap
SRC07-E01 Per-language accuracy: English 70.9%, Hindi 63.1%, Mandarin 64.6%, Arabic 67.4%
SRC06-E01 English prompts outperform Arabic even on Arabic-centric models
SRC08-E01 Tokenization fertility predicts accuracy loss: 8-18pp per token/word
SRC03-E01 XLT exists to address 10+ point cross-language gaps

Contradicting Evidence

Evidence Summary
SRC01-E02 Native prompts outperform English on some tasks (sentiment, coreference) — gap is not universal

Reasoning

The evidence strongly supports the existence of a performance gap, but the gap is not uniformly "English is better." On certain tasks, native-language prompts perform better. The gap magnitude varies from near-zero (GPT-4o on Arabic) to 30+ points (low-resource languages on knowledge tasks). This makes H1 partially correct — the gap exists but is more nuanced than a blanket statement.

Relationship to Other Hypotheses

H1 and H3 are compatible — H1 captures the existence of the gap, while H3 captures its conditional nature. H2 (no gap) is eliminated by the evidence.