E01¶


Research	R0027 — Multilingual prompt engineering challenges
Run	2026-03-26
Query	Q001
Source	SRC06
Evidence	SRC06-E01
Type	Statistical

English prompts outperform Arabic prompts even on Arabic-centric models

URL: https://arxiv.org/html/2409.07054v1

Extract¶

"Non-native prompt performs the best, followed by mixed and native prompts" across 197 experiments. Critically, even Jais-13b-chat, an Arabic-centric model, "showed best results with non-native prompts and struggled significantly with native Arabic instructions." GPT-4o showed the smallest gap between prompt languages. Few-shot learning improved performance notably compared to zero-shot approaches.

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Supports	Demonstrates clear performance variation between English and Arabic prompts
H2	Contradicts	Consistent finding across 3 models and 12 datasets
H3	Supports	The gap varies by model — GPT-4o shows minimal difference while Jais struggles significantly

Context¶

The finding that even an Arabic-centric model performs better with English prompts is a striking result. It suggests the advantage of English is deeply structural — embedded in how models are trained — rather than a simple language capability issue.