R0027/2026-03-26/Q001/H3¶


Research	R0027 — Multilingual prompt engineering challenges
Run	2026-03-26
Query	Q001
Hypothesis	H3

Statement¶

Performance gaps exist but are highly conditional — varying significantly by language resource level, task type, model architecture, and prompting strategy. The relationship between prompt language and effectiveness is not a simple hierarchy but a multi-dimensional interaction.

Status¶

Current: Supported

H3 is the best-supported hypothesis. The evidence consistently shows that while performance gaps are real, they are shaped by at least four interacting factors: (1) language resource level in training data, (2) task type (extractive vs generative, reasoning vs understanding), (3) model architecture and scale, and (4) prompting strategy (native, translated, selective, cross-lingual). No single factor explains the full pattern.

Supporting Evidence¶

Evidence	Summary
SRC02-E01	Selective pre-translation outperforms both full translation and native; optimal strategy varies by task
SRC02-E02	200%+ improvement for low-resource languages with right strategy — language resource level matters
SRC01-E02	Native prompts outperform English on sentiment/coreference — task type matters
SRC06-E01	GPT-4o shows minimal gap while Jais struggles — model matters
SRC08-E01	Reasoning models narrow gap by 8-12 points — model type matters
SRC03-E01	XLT reduces gap by 10+ points — prompting strategy matters
SRC05-E01	Performance hierarchy varies: European > East Asian > South Asian — resource level matters

Contradicting Evidence¶

No evidence directly contradicts H3. All evidence is consistent with a conditional, multi-factor relationship.

Reasoning¶

H3 is the most faithful representation of the evidence. The performance gap between English and non-English is real (consistent with H1) but cannot be described by a single number or simple rule. The optimal prompting strategy for a given language depends on the task type, the model being used, and the language's resource level. This conditionality is the central finding of the research literature.

Relationship to Other Hypotheses¶

H3 subsumes H1 — both agree the gap exists, but H3 adds the crucial qualifier that it is conditional. H2 is eliminated. H3 is the preferred hypothesis.