E01¶


Research	R0023 — Counterproductive advice and prompt lifecycle
Run	2026-03-25
Query	Q001
Source	SRC03
Evidence	SRC03-E01
Type	Statistical

Expert personas provide no reliable improvement across models; 9 statistically significant negative effects on MMLU-Pro.

URL: https://gail.wharton.upenn.edu/research-and-insights/playing-pretend-expert-personas/

Extract¶

On GPQA Diamond: No expert persona consistently improved performance across models. No significant positive effects existed between baseline and domain-matching variations.

On MMLU-Pro: Five of six models showed no statistically significant improvement from expert personas. Nine statistically significant negative differences observed. One exception: Gemini 2.0 Flash showed modest positive differences for five expert personas (e.g., Engineering Expert vs. baseline RD = 0.089 [0.033, 0.148], p = 0.002), but this appears model-specific rather than generalizable.

The study directly challenges vendor recommendations: "Google's Vertex AI guide advises users to 'assign a role'... Anthropic's documentation includes templates like 'You are an expert AI tax analyst'... OpenAI's developer materials suggest prompts such as 'You are a world-class Python developer.'"

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Supports	9 statistically significant negative effects with expert personas — this is not merely "no benefit" but active harm
H2	Contradicts	The negative effects are systematic across 5 of 6 models, not edge cases
H3	Supports	One model (Gemini 2.0 Flash) showed benefits, demonstrating context-dependence

Context¶

This evidence is particularly significant because it directly contradicts the official documentation of the three largest AI providers (OpenAI, Anthropic, Google), all of which recommend persona/role prompting as a best practice.