R0023/2026-03-25/Q001/H3¶
Statement¶
Effectiveness is highly contingent on model, task, and context — no technique is universally helpful or harmful. The same technique can be beneficial or counterproductive depending on model architecture, task type, prompt structure, and other variables.
Status¶
Current: Supported
This is the hypothesis best supported by the evidence. Every study in the evidence base demonstrates context-dependence as the primary finding:
- CoT helps non-reasoning models but hurts reasoning models (SRC02)
- Persona prompting fails for factual accuracy but may help with tone/style (SRC03, SRC04)
- Prompt tweaks produce 60-point swings on individual questions while averaging out in aggregate (SRC05)
- The same technique shows opposite effects across models (Gemini 2.0 Flash benefits from personas; 5 other models do not)
Supporting Evidence¶
| Evidence | Summary |
|---|---|
| SRC02-E01 | CoT: -3.3% in Gemini Flash 2.5, +13.5% in Gemini Flash 2.0, +11.7% in Sonnet 3.5 — radically different effects by model |
| SRC03-E01 | Personas: negative in 5 models, positive in 1 model (Gemini 2.0 Flash) — model-dependent |
| SRC03-E03 | Even domain-matched personas provide no benefit — the intuitive "best case" fails |
| SRC05-E01 | 60-point per-question swings demonstrate extreme per-instance variability |
| SRC04-E01 | Mechanism: persona activation trades factual recall for instruction-following — task-dependent tradeoff |
Contradicting Evidence¶
No evidence directly contradicts H3. All evidence is consistent with context-dependence as the primary pattern.
Reasoning¶
H3 is the most parsimonious explanation of all the evidence. The Wharton Prompting Science Report 1 explicitly makes this its title finding: "Prompt Engineering is Complicated and Contingent." Every subsequent report in the series confirms this. The independent EMNLP study also finds context-dependent effects. The key insight is that universal prompt engineering advice is inherently flawed because no technique has universal effects.
Relationship to Other Hypotheses¶
H3 subsumes H1 — the counterproductive effects documented in H1 are real, but they are instances of the broader pattern described by H3 rather than universal failures. H3 explains why H2 is wrong: popular advice treats techniques as universally beneficial, but the evidence shows they are conditionally beneficial at best.