Skip to content

R0023/2026-03-25/Q001/H3

Statement

Effectiveness is highly contingent on model, task, and context — no technique is universally helpful or harmful. The same technique can be beneficial or counterproductive depending on model architecture, task type, prompt structure, and other variables.

Status

Current: Supported

This is the hypothesis best supported by the evidence. Every study in the evidence base demonstrates context-dependence as the primary finding:

  • CoT helps non-reasoning models but hurts reasoning models (SRC02)
  • Persona prompting fails for factual accuracy but may help with tone/style (SRC03, SRC04)
  • Prompt tweaks produce 60-point swings on individual questions while averaging out in aggregate (SRC05)
  • The same technique shows opposite effects across models (Gemini 2.0 Flash benefits from personas; 5 other models do not)

Supporting Evidence

Evidence Summary
SRC02-E01 CoT: -3.3% in Gemini Flash 2.5, +13.5% in Gemini Flash 2.0, +11.7% in Sonnet 3.5 — radically different effects by model
SRC03-E01 Personas: negative in 5 models, positive in 1 model (Gemini 2.0 Flash) — model-dependent
SRC03-E03 Even domain-matched personas provide no benefit — the intuitive "best case" fails
SRC05-E01 60-point per-question swings demonstrate extreme per-instance variability
SRC04-E01 Mechanism: persona activation trades factual recall for instruction-following — task-dependent tradeoff

Contradicting Evidence

No evidence directly contradicts H3. All evidence is consistent with context-dependence as the primary pattern.

Reasoning

H3 is the most parsimonious explanation of all the evidence. The Wharton Prompting Science Report 1 explicitly makes this its title finding: "Prompt Engineering is Complicated and Contingent." Every subsequent report in the series confirms this. The independent EMNLP study also finds context-dependent effects. The key insight is that universal prompt engineering advice is inherently flawed because no technique has universal effects.

Relationship to Other Hypotheses

H3 subsumes H1 — the counterproductive effects documented in H1 are real, but they are instances of the broader pattern described by H3 rather than universal failures. H3 explains why H2 is wrong: popular advice treats techniques as universally beneficial, but the evidence shows they are conditionally beneficial at best.