E01¶


Research	R0023 — Counterproductive advice and prompt lifecycle
Run	2026-03-25
Query	Q001
Source	SRC02
Evidence	SRC02-E01
Type	Statistical

Chain-of-thought prompting decreases perfect accuracy in reasoning models.

URL: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5285532

Extract¶

For reasoning models with CoT prompting:

Gemini Flash 2.5: -3.3% average accuracy decrease; -13.1% at 100% correctness threshold; -7.1% at 90% threshold
o3-mini: +2.9% average but negligible at strict thresholds
o4-mini: +3.1% average but negligible at strict thresholds

For non-reasoning models, CoT improved Gemini Pro 1.5 by average but decreased perfect accuracy by -17.2%.

CoT increases response time by 35-600% (5-15 seconds) for non-reasoning models and 20-80% (10-20 seconds) for reasoning models, for negligible or negative accuracy gains in reasoning models.

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Supports	Demonstrates CoT can actively reduce accuracy in reasoning models, directly counterproductive
H2	Contradicts	The data shows real negative effects, not just edge cases, across multiple models
H3	Supports	The effect is model-dependent: CoT helps some non-reasoning models while hurting reasoning models

Context¶

These findings are particularly significant because many modern LLMs (reasoning models) already perform internal chain-of-thought processing. Adding explicit CoT prompting on top of built-in reasoning creates redundancy that introduces noise rather than improving output.