Skip to content

R0027/2026-03-26/Q001/H2

Statement

No meaningful or consistent performance gap has been demonstrated between English and non-English prompt engineering effectiveness. Research either does not exist or shows no reliable performance difference across languages.

Status

Current: Eliminated

H2 is eliminated by overwhelming evidence. Multiple independent benchmarks, controlled experiments, and large-scale evaluations consistently demonstrate measurable performance gaps between English and non-English languages. The volume of research (36+ papers in a recent survey), the consistency of findings across different benchmarks, and the identification of a causal mechanism (tokenization bias) all contradict H2.

Supporting Evidence

No evidence was found supporting H2. No study reported equivalent performance across languages.

Contradicting Evidence

Evidence Summary
SRC05-E01 30-point gap across 13 languages
SRC04-E01 Consistent performance hierarchy across 17 languages
SRC07-E01 3.5-21.2pp gaps for Arabic, Mandarin, Hindi, Telugu
SRC06-E01 197 experiments confirming English prompt advantage
SRC08-E01 Structural mechanism (tokenization) explains the gap
SRC03-E01 10+ point gaps documented, prompting techniques developed to address them
SRC01-E01 36 papers studying the phenomenon confirms it is recognized

Reasoning

Every piece of evidence found contradicts H2. The gap is documented across multiple benchmarks, languages, models, and research teams. A structural causal mechanism (tokenization bias) has been identified and quantified. H2 is eliminated with high confidence.

Relationship to Other Hypotheses

H2 is incompatible with both H1 and H3, both of which are supported by the evidence.