E01¶


Research	R0023 — Counterproductive advice and prompt lifecycle
Run	2026-03-25
Query	Q001
Source	SRC05
Evidence	SRC05-E01
Type	Statistical

Prompt tweaks produce 60-point accuracy swings on individual questions that average out in aggregate, masking critical per-question variability.

URL: https://gail.wharton.upenn.edu/research-and-insights/tech-report-prompt-engineering-is-complicated-and-contingent/

Extract¶

At the strictest threshold (100% accuracy across 100 repetitions), GPT-4o performed barely better than random guessing (30.28%), while at lower thresholds it reached 47.54%. Saying "Please" versus "I order you" produced performance swings up to 60 percentage points on individual questions, yet these differences "balance out across the full dataset."

This means aggregate benchmarks mask critical per-question variability. A prompt technique that appears neutral in aggregate may be simultaneously helping some questions and harming others. Single-attempt testing methods obscure reliability issues critical for high-stakes deployment.

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Supports	Demonstrates that techniques which appear neutral in aggregate can be actively counterproductive on specific questions
H2	Partially supports	The aggregate view does show techniques "working" — it's only at the per-question level that harm becomes visible
H3	Strongly supports	This is the core evidence for H3 — effectiveness is contingent on the specific question, model, and measurement threshold

Context¶

This finding has profound implications for how prompt engineering advice is evaluated. Most popular guides test techniques with a handful of examples and report whether they "worked." The Wharton methodology reveals that this approach is fundamentally inadequate — you need hundreds of repetitions with multiple thresholds to detect the variability that single-attempt testing hides.