R0023/2026-03-25/Q001/SRC02
Wharton GAIL study on the decreasing value of chain-of-thought prompting
Source
| Field |
Value |
| Title |
Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting |
| Publisher |
SSRN / Wharton Generative AI Labs |
| Author(s) |
Lennart Meincke, Ethan R. Mollick, Lilach Mollick, Dan Shapiro |
| Date |
2025-06-08 |
| URL |
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5285532 |
| Type |
Research paper (technical report) |
Summary
| Dimension |
Rating |
| Reliability |
High |
| Relevance |
High |
| Bias: Missing data |
Low risk |
| Bias: Measurement |
Low risk |
| Bias: Selective reporting |
Low risk |
| Bias: Randomization |
N/A — not an RCT |
| Bias: Protocol deviation |
N/A — not an RCT |
| Bias: COI/Funding |
Low risk |
Rationale
| Dimension |
Rationale |
| Reliability |
Rigorous methodology: GPQA Diamond (198 PhD-level questions), 25 trials per condition, 8 models tested across reasoning and non-reasoning categories. Multiple correctness thresholds (100%, 90%, 51%, average). Wharton institutional affiliation. |
| Relevance |
Directly addresses whether chain-of-thought — arguably the most widely recommended prompt technique — can be counterproductive. High relevance to Q001. |
| Bias flags |
Low risk. Uses established benchmarks, tests both positive and negative outcomes, reports all results including where CoT helps. Not funded by any AI vendor. |
| Evidence ID |
Summary |
| SRC02-E01 |
CoT decreases perfect accuracy in reasoning models (Gemini Flash 2.5: -13.1% at 100% threshold) |
| SRC02-E02 |
CoT introduces variability causing errors on easy questions the model would otherwise answer correctly |