Skip to content

R0023/2026-03-25/Q001/H2

Statement

Popular prompt engineering techniques are generally beneficial; counterproductive findings are edge cases limited to specific benchmarks, models, or task types that do not generalize.

Status

Current: Eliminated

The weight of evidence contradicts this hypothesis. Multiple independent studies with rigorous methodology demonstrate that counterproductive effects are systematic, not edge cases. The convergence of the Wharton series (4 reports, multiple models, multiple benchmarks) and the independent EMNLP study produces consistent findings: persona prompting degrades factual accuracy across models, CoT can hurt reasoning models, and emotional prompts have no reliable effect. These findings span different research groups, models, benchmarks, and methodologies.

Supporting Evidence

Evidence Summary
SRC01-E01 The Prompt Report documents 58 techniques with best practices, implying they generally work

Contradicting Evidence

Evidence Summary
SRC03-E01 Expert personas show 9 significant negative effects across 5 of 6 models — not an edge case
SRC04-E01 Independent replication: 3.6% accuracy drop across 2,410 questions and 4 LLM families
SRC02-E01 CoT hurts reasoning models consistently, not just on one benchmark
SRC05-E01 60-point per-question swings show effects are masked by aggregate metrics, not genuinely benign

Reasoning

H2 requires that counterproductive findings be isolated to narrow conditions. The evidence shows the opposite: counterproductive effects are reproducible across independent studies, multiple benchmarks (GPQA Diamond, MMLU-Pro), multiple model families (GPT-4o, Gemini, o-series), and multiple research groups (Wharton, Michigan/EMNLP). H2 is eliminated.

Relationship to Other Hypotheses

H2 represents the default assumption held by most prompt engineering practitioners. Its elimination is the most significant finding of this research — the assumption that "these techniques work" is not supported when subjected to rigorous empirical testing.