R0027/2026-03-26/Q001 — Assessment¶
BLUF¶
Prompt engineering effectiveness varies significantly across languages, with extensive published research documenting performance gaps of 3-30 percentage points between English and non-English languages. The gap is not uniform — it depends on language resource level, task type, model architecture, and prompting strategy. Tokenization inefficiency is a primary structural cause.
Probability¶
Rating: Almost certain (95-99%)
Confidence in assessment: High
Confidence rationale: Eight independent sources from multiple research teams, institutions, and countries converge on the same finding. Multiple benchmarks (MMLU-ProX, BenchMAX, educational tasks, Arabic-specific studies) show consistent performance hierarchies. A causal mechanism (tokenization) has been identified and quantified.
Reasoning Chain¶
- A comprehensive survey [SRC01-E01, Medium-High reliability, High relevance] confirms 36 papers studying multilingual prompt engineering, establishing this as a well-researched area.
- Benchmark studies quantify the gap: MMLU-ProX shows a 30-point English-Swahili gap [SRC05-E01, High reliability, High relevance]; BenchMAX confirms high-resource languages consistently outperform low-resource [SRC04-E01, High reliability, High relevance].
- For the specific languages named in Q001: Hindi shows a 7.8pp gap, Mandarin 6.3pp, Arabic 3.5pp relative to English [SRC07-E01, High reliability, High relevance].
- Prompt language itself matters: English prompts outperform translated prompts 72.7% vs 67.2% [SRC07-E02, High reliability, High relevance]. Even Arabic-centric models perform better with English prompts [SRC06-E01, Medium-High reliability, High relevance].
- The gap is conditional: selective pre-translation outperforms both full translation and native prompts [SRC02-E01, High reliability, High relevance]. Native prompts outperform English on sentiment and coreference tasks [SRC01-E02, Medium-High reliability, High relevance].
- A structural cause has been identified: tokenization fertility predicts accuracy, with each additional token per word reducing accuracy by 8-18pp [SRC08-E01, High reliability, Medium-High relevance].
- The gap can be mitigated: XLT prompting reduces it by 10+ points [SRC03-E01, High reliability, High relevance]. Reasoning models narrow it by 8-12 points [SRC08-E01].
- JUDGMENT: The evidence supports H3 (conditional gap) as the most accurate characterization. The gap exists (H1 partially supported), it is not absent (H2 eliminated), and it is highly dependent on specific conditions (H3 supported).
Evidence Base Summary¶
| Source | Description | Reliability | Relevance | Key Finding |
|---|---|---|---|---|
| SRC01 | Vatsal et al. survey | Medium-High | High | 36 papers, 39 techniques, 250 languages |
| SRC02 | Mondshine et al. translation strategies | High | High | Selective pre-translation outperforms; 200%+ gains for low-resource |
| SRC03 | Huang et al. XLT | High | High | 10+ point improvement via cross-lingual prompting |
| SRC04 | BenchMAX benchmark | High | High | 17-language benchmark; scaling does not close gap |
| SRC05 | MMLU-ProX benchmark | High | High | 30-point English-Swahili gap; 13 languages |
| SRC06 | Kmainasi et al. Arabic | Medium-High | High | English prompts beat Arabic even on Arabic-centric models |
| SRC07 | Gupta et al. education | High | High | Per-language accuracy for Hindi, Mandarin, Arabic |
| SRC08 | Lundin et al. token tax | High | Medium-High | Tokenization as causal mechanism; 8-18pp per token/word |
Collection Synthesis¶
| Dimension | Assessment |
|---|---|
| Evidence quality | Robust — 8 sources, 6 of which are High reliability, multiple benchmarks with large datasets |
| Source agreement | High — all sources agree that performance gaps exist; they differ only on magnitude and conditionality |
| Source independence | High — research teams from Bar-Ilan, Microsoft, ETH Zurich, Qatar University, multiple Asian institutions. No common upstream dependency |
| Outliers | SRC01-E02 (native prompts winning on some tasks) is not a true outlier but a nuance captured by H3 |
Detail¶
The evidence base is unusually strong for an emerging field. Multiple independent research teams using different benchmarks, languages, and models converge on the same core finding: prompt engineering effectiveness varies across languages in a predictable, quantifiable way. The identification of a structural mechanism (tokenization bias) provides causal explanatory power beyond correlation.
Gaps¶
| Missing Evidence | Impact on Assessment |
|---|---|
| Japanese-specific prompt comparison study | Japanese was named in Q001 but no Japanese-focused prompting study was found; performance data comes from broader benchmarks only |
| Longitudinal studies tracking gap changes over time | Cannot assess whether the gap is closing as models improve |
| Controlled studies isolating prompt language from task language | Most studies confound these; Mondshine et al. is the exception |
Researcher Bias Check¶
Declared biases: No researcher profile was provided. As a general matter, the query presupposes that multilingual challenges exist (by asking "how does effectiveness vary" rather than "does effectiveness vary"). This framing could lead to confirmation bias.
Influence assessment: The risk was mitigated by including H2 (no gap) as a hypothesis and searching for evidence of equivalent performance. No such evidence was found, so the framing did not distort the outcome.
Cross-References¶
| Entity | ID | File |
|---|---|---|
| Hypotheses | H1, H2, H3 | hypotheses/ |
| Sources | SRC01-SRC08 | sources/ |
| ACH Matrix | — | ach-matrix.md |
| Self-Audit | — | self-audit.md |