Skip to content

R0023/2026-03-25/Q001/SRC03/E01

Research R0023 — Counterproductive advice and prompt lifecycle
Run 2026-03-25
Query Q001
Source SRC03
Evidence SRC03-E01
Type Statistical

Expert personas provide no reliable improvement across models; 9 statistically significant negative effects on MMLU-Pro.

URL: https://gail.wharton.upenn.edu/research-and-insights/playing-pretend-expert-personas/

Extract

On GPQA Diamond: No expert persona consistently improved performance across models. No significant positive effects existed between baseline and domain-matching variations.

On MMLU-Pro: Five of six models showed no statistically significant improvement from expert personas. Nine statistically significant negative differences observed. One exception: Gemini 2.0 Flash showed modest positive differences for five expert personas (e.g., Engineering Expert vs. baseline RD = 0.089 [0.033, 0.148], p = 0.002), but this appears model-specific rather than generalizable.

The study directly challenges vendor recommendations: "Google's Vertex AI guide advises users to 'assign a role'... Anthropic's documentation includes templates like 'You are an expert AI tax analyst'... OpenAI's developer materials suggest prompts such as 'You are a world-class Python developer.'"

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Supports 9 statistically significant negative effects with expert personas — this is not merely "no benefit" but active harm
H2 Contradicts The negative effects are systematic across 5 of 6 models, not edge cases
H3 Supports One model (Gemini 2.0 Flash) showed benefits, demonstrating context-dependence

Context

This evidence is particularly significant because it directly contradicts the official documentation of the three largest AI providers (OpenAI, Anthropic, Google), all of which recommend persona/role prompting as a best practice.