E01¶


Research	R0023 — Counterproductive advice and prompt lifecycle
Run	2026-03-25
Query	Q003
Source	SRC02
Evidence	SRC02-E01
Type	Statistical

Same model, same prompt produces inconsistent results across repetitions — baseline variability makes degradation detection challenging.

URL: https://gail.wharton.upenn.edu/research-and-insights/tech-report-prompt-engineering-is-complicated-and-contingent/

Extract¶

The same model with identical prompts produced inconsistent answers across 100 repetitions. At the strictest threshold (100% accuracy), GPT-4o performed at 30.28% — barely above chance. This means single-attempt comparisons between model versions are unreliable for detecting degradation. The signal-to-noise ratio is low, and many reported cases of "prompt degradation" may actually be normal stochastic variation.

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Partially contradicts	Some reported degradation may be normal variation, not true degradation
H2	Supports	Reinforces the view that practitioner reports may be anecdotal noise
H3	Supports	Adds the dimension that stochastic variation complicates the degradation picture