R0023/2026-03-25/Q001/H2¶


Research	R0023 — Counterproductive advice and prompt lifecycle
Run	2026-03-25
Query	Q001
Hypothesis	H2

Statement¶

Popular prompt engineering techniques are generally beneficial; counterproductive findings are edge cases limited to specific benchmarks, models, or task types that do not generalize.

Status¶

Current: Eliminated

The weight of evidence contradicts this hypothesis. Multiple independent studies with rigorous methodology demonstrate that counterproductive effects are systematic, not edge cases. The convergence of the Wharton series (4 reports, multiple models, multiple benchmarks) and the independent EMNLP study produces consistent findings: persona prompting degrades factual accuracy across models, CoT can hurt reasoning models, and emotional prompts have no reliable effect. These findings span different research groups, models, benchmarks, and methodologies.

Supporting Evidence¶

Evidence	Summary
SRC01-E01	The Prompt Report documents 58 techniques with best practices, implying they generally work

Contradicting Evidence¶

Evidence	Summary
SRC03-E01	Expert personas show 9 significant negative effects across 5 of 6 models — not an edge case
SRC04-E01	Independent replication: 3.6% accuracy drop across 2,410 questions and 4 LLM families
SRC02-E01	CoT hurts reasoning models consistently, not just on one benchmark
SRC05-E01	60-point per-question swings show effects are masked by aggregate metrics, not genuinely benign

Reasoning¶

H2 requires that counterproductive findings be isolated to narrow conditions. The evidence shows the opposite: counterproductive effects are reproducible across independent studies, multiple benchmarks (GPQA Diamond, MMLU-Pro), multiple model families (GPT-4o, Gemini, o-series), and multiple research groups (Wharton, Michigan/EMNLP). H2 is eliminated.

Relationship to Other Hypotheses¶

H2 represents the default assumption held by most prompt engineering practitioners. Its elimination is the most significant finding of this research — the assumption that "these techniques work" is not supported when subjected to rigorous empirical testing.