R0023/2026-03-25/Q001/SRC05/E01¶
Prompt tweaks produce 60-point accuracy swings on individual questions that average out in aggregate, masking critical per-question variability.
URL: https://gail.wharton.upenn.edu/research-and-insights/tech-report-prompt-engineering-is-complicated-and-contingent/
Extract¶
At the strictest threshold (100% accuracy across 100 repetitions), GPT-4o performed barely better than random guessing (30.28%), while at lower thresholds it reached 47.54%. Saying "Please" versus "I order you" produced performance swings up to 60 percentage points on individual questions, yet these differences "balance out across the full dataset."
This means aggregate benchmarks mask critical per-question variability. A prompt technique that appears neutral in aggregate may be simultaneously helping some questions and harming others. Single-attempt testing methods obscure reliability issues critical for high-stakes deployment.
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports | Demonstrates that techniques which appear neutral in aggregate can be actively counterproductive on specific questions |
| H2 | Partially supports | The aggregate view does show techniques "working" — it's only at the per-question level that harm becomes visible |
| H3 | Strongly supports | This is the core evidence for H3 — effectiveness is contingent on the specific question, model, and measurement threshold |
Context¶
This finding has profound implications for how prompt engineering advice is evaluated. Most popular guides test techniques with a handful of examples and report whether they "worked." The Wharton methodology reveals that this approach is fundamentally inadequate — you need hundreds of repetitions with multiple thresholds to detect the variability that single-attempt testing hides.