Skip to content

R0023/2026-03-25/Q001/SRC05/E01

Research R0023 — Counterproductive advice and prompt lifecycle
Run 2026-03-25
Query Q001
Source SRC05
Evidence SRC05-E01
Type Statistical

Prompt tweaks produce 60-point accuracy swings on individual questions that average out in aggregate, masking critical per-question variability.

URL: https://gail.wharton.upenn.edu/research-and-insights/tech-report-prompt-engineering-is-complicated-and-contingent/

Extract

At the strictest threshold (100% accuracy across 100 repetitions), GPT-4o performed barely better than random guessing (30.28%), while at lower thresholds it reached 47.54%. Saying "Please" versus "I order you" produced performance swings up to 60 percentage points on individual questions, yet these differences "balance out across the full dataset."

This means aggregate benchmarks mask critical per-question variability. A prompt technique that appears neutral in aggregate may be simultaneously helping some questions and harming others. Single-attempt testing methods obscure reliability issues critical for high-stakes deployment.

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Supports Demonstrates that techniques which appear neutral in aggregate can be actively counterproductive on specific questions
H2 Partially supports The aggregate view does show techniques "working" — it's only at the per-question level that harm becomes visible
H3 Strongly supports This is the core evidence for H3 — effectiveness is contingent on the specific question, model, and measurement threshold

Context

This finding has profound implications for how prompt engineering advice is evaluated. Most popular guides test techniques with a handful of examples and report whether they "worked." The Wharton methodology reveals that this approach is fundamentally inadequate — you need hundreds of repetitions with multiple thresholds to detect the variability that single-attempt testing hides.