Skip to content

R0023/2026-03-25/Q001/SRC02/E01

Research R0023 — Counterproductive advice and prompt lifecycle
Run 2026-03-25
Query Q001
Source SRC02
Evidence SRC02-E01
Type Statistical

Chain-of-thought prompting decreases perfect accuracy in reasoning models.

URL: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5285532

Extract

For reasoning models with CoT prompting:

  • Gemini Flash 2.5: -3.3% average accuracy decrease; -13.1% at 100% correctness threshold; -7.1% at 90% threshold
  • o3-mini: +2.9% average but negligible at strict thresholds
  • o4-mini: +3.1% average but negligible at strict thresholds

For non-reasoning models, CoT improved Gemini Pro 1.5 by average but decreased perfect accuracy by -17.2%.

CoT increases response time by 35-600% (5-15 seconds) for non-reasoning models and 20-80% (10-20 seconds) for reasoning models, for negligible or negative accuracy gains in reasoning models.

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Supports Demonstrates CoT can actively reduce accuracy in reasoning models, directly counterproductive
H2 Contradicts The data shows real negative effects, not just edge cases, across multiple models
H3 Supports The effect is model-dependent: CoT helps some non-reasoning models while hurting reasoning models

Context

These findings are particularly significant because many modern LLMs (reasoning models) already perform internal chain-of-thought processing. Adding explicit CoT prompting on top of built-in reasoning creates redundancy that introduces noise rather than improving output.