R0023/2026-03-25/Q001/H2¶
Statement¶
Popular prompt engineering techniques are generally beneficial; counterproductive findings are edge cases limited to specific benchmarks, models, or task types that do not generalize.
Status¶
Current: Eliminated
The weight of evidence contradicts this hypothesis. Multiple independent studies with rigorous methodology demonstrate that counterproductive effects are systematic, not edge cases. The convergence of the Wharton series (4 reports, multiple models, multiple benchmarks) and the independent EMNLP study produces consistent findings: persona prompting degrades factual accuracy across models, CoT can hurt reasoning models, and emotional prompts have no reliable effect. These findings span different research groups, models, benchmarks, and methodologies.
Supporting Evidence¶
| Evidence | Summary |
|---|---|
| SRC01-E01 | The Prompt Report documents 58 techniques with best practices, implying they generally work |
Contradicting Evidence¶
| Evidence | Summary |
|---|---|
| SRC03-E01 | Expert personas show 9 significant negative effects across 5 of 6 models — not an edge case |
| SRC04-E01 | Independent replication: 3.6% accuracy drop across 2,410 questions and 4 LLM families |
| SRC02-E01 | CoT hurts reasoning models consistently, not just on one benchmark |
| SRC05-E01 | 60-point per-question swings show effects are masked by aggregate metrics, not genuinely benign |
Reasoning¶
H2 requires that counterproductive findings be isolated to narrow conditions. The evidence shows the opposite: counterproductive effects are reproducible across independent studies, multiple benchmarks (GPQA Diamond, MMLU-Pro), multiple model families (GPT-4o, Gemini, o-series), and multiple research groups (Wharton, Michigan/EMNLP). H2 is eliminated.
Relationship to Other Hypotheses¶
H2 represents the default assumption held by most prompt engineering practitioners. Its elimination is the most significant finding of this research — the assumption that "these techniques work" is not supported when subjected to rigorous empirical testing.