Skip to content

R0023/2026-03-25/Q001/SRC03

Wharton GAIL study: expert personas do not improve factual accuracy across 6 models and 2 benchmarks

Source

Field Value
Title Prompting Science Report 4: Playing Pretend: Expert Personas Don't Improve Factual Accuracy
Publisher Wharton Generative AI Labs / arXiv
Author(s) Savir Basil, Ina Shapiro, Dan Shapiro, Ethan Mollick, Lilach Mollick, Lennart Meincke
Date 2025-12-07
URL https://gail.wharton.upenn.edu/research-and-insights/playing-pretend-expert-personas/
Type Research paper (technical report)

Summary

Dimension Rating
Reliability High
Relevance High
Bias: Missing data Low risk
Bias: Measurement Low risk
Bias: Selective reporting Low risk
Bias: Randomization N/A — not an RCT
Bias: Protocol deviation N/A — not an RCT
Bias: COI/Funding Low risk

Rationale

Dimension Rationale
Reliability Rigorous methodology: 6 models (GPT-4o, GPT-4o-mini, o3-mini, o4-mini, Gemini 2.0 Flash, Gemini 2.5 Flash), 2 benchmarks (GPQA Diamond 198 Qs, MMLU-Pro 300 Qs), 12 prompting conditions, 25 trials per question per condition. Temperature 1.0, zero-shot.
Relevance Directly tests the most common prompt engineering advice: "You are an expert in X." Highest possible relevance to Q001.
Bias flags Low risk. Multiple models, multiple benchmarks, reports both positive and negative effects, identifies the one model-specific exception (Gemini 2.0 Flash). Not vendor-funded.

Evidence Extracts

Evidence ID Summary
SRC03-E01 Expert personas provide no reliable improvement; 9 statistically significant negative effects found on MMLU-Pro
SRC03-E02 Low-knowledge personas (toddler, layperson) actively reduce accuracy in o4-mini and GPT-4o
SRC03-E03 Domain-matched expert personas provide no meaningful benefit over baseline