Skip to content

R0023/2026-03-25/Q001/SRC02

Wharton GAIL study on the decreasing value of chain-of-thought prompting

Source

Field Value
Title Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting
Publisher SSRN / Wharton Generative AI Labs
Author(s) Lennart Meincke, Ethan R. Mollick, Lilach Mollick, Dan Shapiro
Date 2025-06-08
URL https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5285532
Type Research paper (technical report)

Summary

Dimension Rating
Reliability High
Relevance High
Bias: Missing data Low risk
Bias: Measurement Low risk
Bias: Selective reporting Low risk
Bias: Randomization N/A — not an RCT
Bias: Protocol deviation N/A — not an RCT
Bias: COI/Funding Low risk

Rationale

Dimension Rationale
Reliability Rigorous methodology: GPQA Diamond (198 PhD-level questions), 25 trials per condition, 8 models tested across reasoning and non-reasoning categories. Multiple correctness thresholds (100%, 90%, 51%, average). Wharton institutional affiliation.
Relevance Directly addresses whether chain-of-thought — arguably the most widely recommended prompt technique — can be counterproductive. High relevance to Q001.
Bias flags Low risk. Uses established benchmarks, tests both positive and negative outcomes, reports all results including where CoT helps. Not funded by any AI vendor.

Evidence Extracts

Evidence ID Summary
SRC02-E01 CoT decreases perfect accuracy in reasoning models (Gemini Flash 2.5: -13.1% at 100% threshold)
SRC02-E02 CoT introduces variability causing errors on easy questions the model would otherwise answer correctly