R0027/2026-03-26/Q001/H1¶
Statement¶
Significant, well-documented performance gaps exist between English and non-English prompt engineering effectiveness. Published research consistently shows measurable degradation for non-English languages.
Status¶
Current: Partially supported
H1 is partially supported because the evidence overwhelmingly confirms that performance gaps exist and are well-documented. However, the framing of H1 as a simple, consistent degradation oversimplifies what the evidence shows. The gap is real but its magnitude and direction are conditional on task type, language resource level, model architecture, and prompting strategy. H3 better captures the evidence pattern.
Supporting Evidence¶
| Evidence | Summary |
|---|---|
| SRC05-E01 | 30-point English-Swahili gap on MMLU-ProX with clear language hierarchy |
| SRC04-E01 | High-resource languages consistently outperform low-resource; scaling does not close gap |
| SRC07-E01 | Per-language accuracy: English 70.9%, Hindi 63.1%, Mandarin 64.6%, Arabic 67.4% |
| SRC06-E01 | English prompts outperform Arabic even on Arabic-centric models |
| SRC08-E01 | Tokenization fertility predicts accuracy loss: 8-18pp per token/word |
| SRC03-E01 | XLT exists to address 10+ point cross-language gaps |
Contradicting Evidence¶
| Evidence | Summary |
|---|---|
| SRC01-E02 | Native prompts outperform English on some tasks (sentiment, coreference) — gap is not universal |
Reasoning¶
The evidence strongly supports the existence of a performance gap, but the gap is not uniformly "English is better." On certain tasks, native-language prompts perform better. The gap magnitude varies from near-zero (GPT-4o on Arabic) to 30+ points (low-resource languages on knowledge tasks). This makes H1 partially correct — the gap exists but is more nuanced than a blanket statement.
Relationship to Other Hypotheses¶
H1 and H3 are compatible — H1 captures the existence of the gap, while H3 captures its conditional nature. H2 (no gap) is eliminated by the evidence.