R0023/2026-03-25/Q001/SRC03
Wharton GAIL study: expert personas do not improve factual accuracy across 6 models and 2 benchmarks
Source
| Field |
Value |
| Title |
Prompting Science Report 4: Playing Pretend: Expert Personas Don't Improve Factual Accuracy |
| Publisher |
Wharton Generative AI Labs / arXiv |
| Author(s) |
Savir Basil, Ina Shapiro, Dan Shapiro, Ethan Mollick, Lilach Mollick, Lennart Meincke |
| Date |
2025-12-07 |
| URL |
https://gail.wharton.upenn.edu/research-and-insights/playing-pretend-expert-personas/ |
| Type |
Research paper (technical report) |
Summary
| Dimension |
Rating |
| Reliability |
High |
| Relevance |
High |
| Bias: Missing data |
Low risk |
| Bias: Measurement |
Low risk |
| Bias: Selective reporting |
Low risk |
| Bias: Randomization |
N/A — not an RCT |
| Bias: Protocol deviation |
N/A — not an RCT |
| Bias: COI/Funding |
Low risk |
Rationale
| Dimension |
Rationale |
| Reliability |
Rigorous methodology: 6 models (GPT-4o, GPT-4o-mini, o3-mini, o4-mini, Gemini 2.0 Flash, Gemini 2.5 Flash), 2 benchmarks (GPQA Diamond 198 Qs, MMLU-Pro 300 Qs), 12 prompting conditions, 25 trials per question per condition. Temperature 1.0, zero-shot. |
| Relevance |
Directly tests the most common prompt engineering advice: "You are an expert in X." Highest possible relevance to Q001. |
| Bias flags |
Low risk. Multiple models, multiple benchmarks, reports both positive and negative effects, identifies the one model-specific exception (Gemini 2.0 Flash). Not vendor-funded. |
| Evidence ID |
Summary |
| SRC03-E01 |
Expert personas provide no reliable improvement; 9 statistically significant negative effects found on MMLU-Pro |
| SRC03-E02 |
Low-knowledge personas (toddler, layperson) actively reduce accuracy in o4-mini and GPT-4o |
| SRC03-E03 |
Domain-matched expert personas provide no meaningful benefit over baseline |