Skip to content

R0023/2026-03-25/Q003/SRC02/E01

Research R0023 — Counterproductive advice and prompt lifecycle
Run 2026-03-25
Query Q003
Source SRC02
Evidence SRC02-E01
Type Statistical

Same model, same prompt produces inconsistent results across repetitions — baseline variability makes degradation detection challenging.

URL: https://gail.wharton.upenn.edu/research-and-insights/tech-report-prompt-engineering-is-complicated-and-contingent/

Extract

The same model with identical prompts produced inconsistent answers across 100 repetitions. At the strictest threshold (100% accuracy), GPT-4o performed at 30.28% — barely above chance. This means single-attempt comparisons between model versions are unreliable for detecting degradation. The signal-to-noise ratio is low, and many reported cases of "prompt degradation" may actually be normal stochastic variation.

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Partially contradicts Some reported degradation may be normal variation, not true degradation
H2 Supports Reinforces the view that practitioner reports may be anecdotal noise
H3 Supports Adds the dimension that stochastic variation complicates the degradation picture