R0023/2026-03-25/Q001/H3¶


Research	R0023 — Counterproductive advice and prompt lifecycle
Run	2026-03-25
Query	Q001
Hypothesis	H3

Statement¶

Effectiveness is highly contingent on model, task, and context — no technique is universally helpful or harmful. The same technique can be beneficial or counterproductive depending on model architecture, task type, prompt structure, and other variables.

Status¶

Current: Supported

This is the hypothesis best supported by the evidence. Every study in the evidence base demonstrates context-dependence as the primary finding:

CoT helps non-reasoning models but hurts reasoning models (SRC02)
Persona prompting fails for factual accuracy but may help with tone/style (SRC03, SRC04)
Prompt tweaks produce 60-point swings on individual questions while averaging out in aggregate (SRC05)
The same technique shows opposite effects across models (Gemini 2.0 Flash benefits from personas; 5 other models do not)

Supporting Evidence¶

Evidence	Summary
SRC02-E01	CoT: -3.3% in Gemini Flash 2.5, +13.5% in Gemini Flash 2.0, +11.7% in Sonnet 3.5 — radically different effects by model
SRC03-E01	Personas: negative in 5 models, positive in 1 model (Gemini 2.0 Flash) — model-dependent
SRC03-E03	Even domain-matched personas provide no benefit — the intuitive "best case" fails
SRC05-E01	60-point per-question swings demonstrate extreme per-instance variability
SRC04-E01	Mechanism: persona activation trades factual recall for instruction-following — task-dependent tradeoff

Contradicting Evidence¶

No evidence directly contradicts H3. All evidence is consistent with context-dependence as the primary pattern.

Reasoning¶

H3 is the most parsimonious explanation of all the evidence. The Wharton Prompting Science Report 1 explicitly makes this its title finding: "Prompt Engineering is Complicated and Contingent." Every subsequent report in the series confirms this. The independent EMNLP study also finds context-dependent effects. The key insight is that universal prompt engineering advice is inherently flawed because no technique has universal effects.

Relationship to Other Hypotheses¶

H3 subsumes H1 — the counterproductive effects documented in H1 are real, but they are instances of the broader pattern described by H3 rather than universal failures. H3 explains why H2 is wrong: popular advice treats techniques as universally beneficial, but the evidence shows they are conditionally beneficial at best.