R0023/2026-03-25/Q001/SRC03/E01¶
Expert personas provide no reliable improvement across models; 9 statistically significant negative effects on MMLU-Pro.
URL: https://gail.wharton.upenn.edu/research-and-insights/playing-pretend-expert-personas/
Extract¶
On GPQA Diamond: No expert persona consistently improved performance across models. No significant positive effects existed between baseline and domain-matching variations.
On MMLU-Pro: Five of six models showed no statistically significant improvement from expert personas. Nine statistically significant negative differences observed. One exception: Gemini 2.0 Flash showed modest positive differences for five expert personas (e.g., Engineering Expert vs. baseline RD = 0.089 [0.033, 0.148], p = 0.002), but this appears model-specific rather than generalizable.
The study directly challenges vendor recommendations: "Google's Vertex AI guide advises users to 'assign a role'... Anthropic's documentation includes templates like 'You are an expert AI tax analyst'... OpenAI's developer materials suggest prompts such as 'You are a world-class Python developer.'"
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports | 9 statistically significant negative effects with expert personas — this is not merely "no benefit" but active harm |
| H2 | Contradicts | The negative effects are systematic across 5 of 6 models, not edge cases |
| H3 | Supports | One model (Gemini 2.0 Flash) showed benefits, demonstrating context-dependence |
Context¶
This evidence is particularly significant because it directly contradicts the official documentation of the three largest AI providers (OpenAI, Anthropic, Google), all of which recommend persona/role prompting as a best practice.