R0027/2026-03-26/Q001/H2¶


Research	R0027 — Multilingual prompt engineering challenges
Run	2026-03-26
Query	Q001
Hypothesis	H2

Statement¶

No meaningful or consistent performance gap has been demonstrated between English and non-English prompt engineering effectiveness. Research either does not exist or shows no reliable performance difference across languages.

Status¶

Current: Eliminated

H2 is eliminated by overwhelming evidence. Multiple independent benchmarks, controlled experiments, and large-scale evaluations consistently demonstrate measurable performance gaps between English and non-English languages. The volume of research (36+ papers in a recent survey), the consistency of findings across different benchmarks, and the identification of a causal mechanism (tokenization bias) all contradict H2.

Supporting Evidence¶

No evidence was found supporting H2. No study reported equivalent performance across languages.

Contradicting Evidence¶

Evidence	Summary
SRC05-E01	30-point gap across 13 languages
SRC04-E01	Consistent performance hierarchy across 17 languages
SRC07-E01	3.5-21.2pp gaps for Arabic, Mandarin, Hindi, Telugu
SRC06-E01	197 experiments confirming English prompt advantage
SRC08-E01	Structural mechanism (tokenization) explains the gap
SRC03-E01	10+ point gaps documented, prompting techniques developed to address them
SRC01-E01	36 papers studying the phenomenon confirms it is recognized

Reasoning¶

Every piece of evidence found contradicts H2. The gap is documented across multiple benchmarks, languages, models, and research teams. A structural causal mechanism (tokenization bias) has been identified and quantified. H2 is eliminated with high confidence.

Relationship to Other Hypotheses¶

H2 is incompatible with both H1 and H3, both of which are supported by the evidence.