SRC03¶


Research	R0023 — Counterproductive advice and prompt lifecycle
Run	2026-03-25
Query	Q001
Search	S02
Result	S02-R03
Source	SRC03

Wharton GAIL study: expert personas do not improve factual accuracy across 6 models and 2 benchmarks

Source¶

Field	Value
Title	Prompting Science Report 4: Playing Pretend: Expert Personas Don't Improve Factual Accuracy
Publisher	Wharton Generative AI Labs / arXiv
Author(s)	Savir Basil, Ina Shapiro, Dan Shapiro, Ethan Mollick, Lilach Mollick, Lennart Meincke
Date	2025-12-07
URL	https://gail.wharton.upenn.edu/research-and-insights/playing-pretend-expert-personas/
Type	Research paper (technical report)

Dimension	Rationale
Reliability	Rigorous methodology: 6 models (GPT-4o, GPT-4o-mini, o3-mini, o4-mini, Gemini 2.0 Flash, Gemini 2.5 Flash), 2 benchmarks (GPQA Diamond 198 Qs, MMLU-Pro 300 Qs), 12 prompting conditions, 25 trials per question per condition. Temperature 1.0, zero-shot.
Relevance	Directly tests the most common prompt engineering advice: "You are an expert in X." Highest possible relevance to Q001.
Bias flags	Low risk. Multiple models, multiple benchmarks, reports both positive and negative effects, identifies the one model-specific exception (Gemini 2.0 Flash). Not vendor-funded.

Evidence ID	Summary
SRC03-E01	Expert personas provide no reliable improvement; 9 statistically significant negative effects found on MMLU-Pro
SRC03-E02	Low-knowledge personas (toddler, layperson) actively reduce accuracy in o4-mini and GPT-4o
SRC03-E03	Domain-matched expert personas provide no meaningful benefit over baseline