R0023/2026-03-25/Q003/SRC02/E01¶
Same model, same prompt produces inconsistent results across repetitions — baseline variability makes degradation detection challenging.
URL: https://gail.wharton.upenn.edu/research-and-insights/tech-report-prompt-engineering-is-complicated-and-contingent/
Extract¶
The same model with identical prompts produced inconsistent answers across 100 repetitions. At the strictest threshold (100% accuracy), GPT-4o performed at 30.28% — barely above chance. This means single-attempt comparisons between model versions are unreliable for detecting degradation. The signal-to-noise ratio is low, and many reported cases of "prompt degradation" may actually be normal stochastic variation.
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Partially contradicts | Some reported degradation may be normal variation, not true degradation |
| H2 | Supports | Reinforces the view that practitioner reports may be anecdotal noise |
| H3 | Supports | Adds the dimension that stochastic variation complicates the degradation picture |