R0027/2026-03-26/Q001/H2¶
Statement¶
No meaningful or consistent performance gap has been demonstrated between English and non-English prompt engineering effectiveness. Research either does not exist or shows no reliable performance difference across languages.
Status¶
Current: Eliminated
H2 is eliminated by overwhelming evidence. Multiple independent benchmarks, controlled experiments, and large-scale evaluations consistently demonstrate measurable performance gaps between English and non-English languages. The volume of research (36+ papers in a recent survey), the consistency of findings across different benchmarks, and the identification of a causal mechanism (tokenization bias) all contradict H2.
Supporting Evidence¶
No evidence was found supporting H2. No study reported equivalent performance across languages.
Contradicting Evidence¶
| Evidence | Summary |
|---|---|
| SRC05-E01 | 30-point gap across 13 languages |
| SRC04-E01 | Consistent performance hierarchy across 17 languages |
| SRC07-E01 | 3.5-21.2pp gaps for Arabic, Mandarin, Hindi, Telugu |
| SRC06-E01 | 197 experiments confirming English prompt advantage |
| SRC08-E01 | Structural mechanism (tokenization) explains the gap |
| SRC03-E01 | 10+ point gaps documented, prompting techniques developed to address them |
| SRC01-E01 | 36 papers studying the phenomenon confirms it is recognized |
Reasoning¶
Every piece of evidence found contradicts H2. The gap is documented across multiple benchmarks, languages, models, and research teams. A structural causal mechanism (tokenization bias) has been identified and quantified. H2 is eliminated with high confidence.
Relationship to Other Hypotheses¶
H2 is incompatible with both H1 and H3, both of which are supported by the evidence.