R0020/2026-03-25/Q001/H3¶
Statement¶
Testing tools and methodologies for AI prompts exist but the field is nascent, with significant gaps between traditional testing rigor and current prompt evaluation capabilities. Key challenges including non-determinism, lack of standardized metrics, and reliance on subjective judgment remain unsolved.
Status¶
Current: Supported
This hypothesis best fits the evidence. Tools exist and are actively developed, but the field explicitly acknowledges fundamental unsolved challenges. The shift "from vibes to verified metrics" is described as an ongoing transition, not a completed one.
Supporting Evidence¶
| Evidence | Summary |
|---|---|
| SRC01-E02 | "LLM outputs are subjective and non-deterministic...there often isn't a single 'right' answer" |
| SRC02-E02 | Production gap: prompts that work in testing produce inconsistent outputs in production |
| SRC04-E01 | LLM judges introduce variability; teams must run 3-5 trials per case to distinguish signal from noise |
| SRC03-E02 | Output unpredictability, model dependency, and scalability cited as key challenges |
Contradicting Evidence¶
| Evidence | Summary |
|---|---|
| SRC01-E01 | The existence of six dedicated frameworks with CI/CD integration suggests movement toward maturity |
Reasoning¶
The evidence converges on a clear picture: the tooling exists, the need is recognized, and active development is underway. However, every source that describes the tools also acknowledges fundamental limitations. Non-determinism is the root challenge — it means prompt testing cannot achieve the deterministic pass/fail guarantees of traditional software testing. The field is evolving toward statistical and heuristic approaches (golden datasets, LLM-as-judge, regression testing with confidence intervals) that represent a pragmatic adaptation rather than a mature solution.
Relationship to Other Hypotheses¶
H3 subsumes the valid parts of H1 (tools exist) while acknowledging the limitations that prevent full H1 support. H2 is eliminated. H3 represents the nuanced middle ground that the evidence most strongly supports.