R0027/2026-03-26/Q001/H1¶


Research	R0027 — Multilingual prompt engineering challenges
Run	2026-03-26
Query	Q001
Hypothesis	H1

Statement¶

Significant, well-documented performance gaps exist between English and non-English prompt engineering effectiveness. Published research consistently shows measurable degradation for non-English languages.

Status¶

Current: Partially supported

H1 is partially supported because the evidence overwhelmingly confirms that performance gaps exist and are well-documented. However, the framing of H1 as a simple, consistent degradation oversimplifies what the evidence shows. The gap is real but its magnitude and direction are conditional on task type, language resource level, model architecture, and prompting strategy. H3 better captures the evidence pattern.

Supporting Evidence¶

Evidence	Summary
SRC05-E01	30-point English-Swahili gap on MMLU-ProX with clear language hierarchy
SRC04-E01	High-resource languages consistently outperform low-resource; scaling does not close gap
SRC07-E01	Per-language accuracy: English 70.9%, Hindi 63.1%, Mandarin 64.6%, Arabic 67.4%
SRC06-E01	English prompts outperform Arabic even on Arabic-centric models
SRC08-E01	Tokenization fertility predicts accuracy loss: 8-18pp per token/word
SRC03-E01	XLT exists to address 10+ point cross-language gaps

Contradicting Evidence¶

Evidence	Summary
SRC01-E02	Native prompts outperform English on some tasks (sentiment, coreference) — gap is not universal

Reasoning¶

The evidence strongly supports the existence of a performance gap, but the gap is not uniformly "English is better." On certain tasks, native-language prompts perform better. The gap magnitude varies from near-zero (GPT-4o on Arabic) to 30+ points (low-resource languages on knowledge tasks). This makes H1 partially correct — the gap exists but is more nuanced than a blanket statement.

Relationship to Other Hypotheses¶

H1 and H3 are compatible — H1 captures the existence of the gap, while H3 captures its conditional nature. H2 (no gap) is eliminated by the evidence.