R0023/2026-03-25/Q001/SRC02/E01¶
Chain-of-thought prompting decreases perfect accuracy in reasoning models.
URL: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5285532
Extract¶
For reasoning models with CoT prompting:
- Gemini Flash 2.5: -3.3% average accuracy decrease; -13.1% at 100% correctness threshold; -7.1% at 90% threshold
- o3-mini: +2.9% average but negligible at strict thresholds
- o4-mini: +3.1% average but negligible at strict thresholds
For non-reasoning models, CoT improved Gemini Pro 1.5 by average but decreased perfect accuracy by -17.2%.
CoT increases response time by 35-600% (5-15 seconds) for non-reasoning models and 20-80% (10-20 seconds) for reasoning models, for negligible or negative accuracy gains in reasoning models.
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports | Demonstrates CoT can actively reduce accuracy in reasoning models, directly counterproductive |
| H2 | Contradicts | The data shows real negative effects, not just edge cases, across multiple models |
| H3 | Supports | The effect is model-dependent: CoT helps some non-reasoning models while hurting reasoning models |
Context¶
These findings are particularly significant because many modern LLMs (reasoning models) already perform internal chain-of-thought processing. Adding explicit CoT prompting on top of built-in reasoning creates redundancy that introduces noise rather than improving output.