E01¶


Research	R0020 — Prompt Engineering Gaps
Run	2026-03-25
Query	Q001
Source	SRC04
Evidence	SRC04-E01
Type	Analytical

Evaluation methodology: golden datasets, LLM-as-judge, regression testing with noise mitigation

URL: https://www.braintrust.dev/articles/what-is-prompt-evaluation

Extract¶

Prompt evaluation methodology components:

Golden datasets: Teams curate 20-50 test cases paired with expected outputs, drawn from production logs to capture real user behavior rather than synthetic examples.

LLM-as-judge scoring: A capable model evaluates outputs using the original input, defined criteria, and optionally reference answers, returning structured scores with reasoning. This addresses limitations of traditional string-matching metrics like BLEU or ROUGE.

Noise mitigation: "LLM judges introduce some variability, small score differences between prompt versions can reflect noise rather than real improvement." Teams mitigate by running 3-5 trials per test case and computing confidence intervals.

Regression testing: Integration into CI/CD pipelines with clear thresholds (e.g., factuality >= 0.85), comparing new versions against production baselines and blocking merges that don't meet standards.

Quality dimensions: Correctness, groundedness, relevance, style/format adherence, safety, latency, and cost.

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Supports	Regression testing with CI/CD integration shows maturity in some areas
H2	Contradicts	Structured, quantitative evaluation methodology exists
H3	Supports	The need for 3-5 trials and confidence intervals to distinguish signal from noise demonstrates fundamental limitations

Context¶

The 20-50 golden test case recommendation is strikingly small compared to traditional software test suites (which may have thousands). The need for multiple trials per test case and confidence intervals reflects a fundamentally statistical approach to testing — more akin to experimental science than software QA. This is perhaps the most diagnostic evidence for the H3 hypothesis: tools exist, but they operate under constraints that traditional testing does not face.