R0020/2026-03-25/Q001/SRC04/E01¶
Evaluation methodology: golden datasets, LLM-as-judge, regression testing with noise mitigation
URL: https://www.braintrust.dev/articles/what-is-prompt-evaluation
Extract¶
Prompt evaluation methodology components:
Golden datasets: Teams curate 20-50 test cases paired with expected outputs, drawn from production logs to capture real user behavior rather than synthetic examples.
LLM-as-judge scoring: A capable model evaluates outputs using the original input, defined criteria, and optionally reference answers, returning structured scores with reasoning. This addresses limitations of traditional string-matching metrics like BLEU or ROUGE.
Noise mitigation: "LLM judges introduce some variability, small score differences between prompt versions can reflect noise rather than real improvement." Teams mitigate by running 3-5 trials per test case and computing confidence intervals.
Regression testing: Integration into CI/CD pipelines with clear thresholds (e.g., factuality >= 0.85), comparing new versions against production baselines and blocking merges that don't meet standards.
Quality dimensions: Correctness, groundedness, relevance, style/format adherence, safety, latency, and cost.
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports | Regression testing with CI/CD integration shows maturity in some areas |
| H2 | Contradicts | Structured, quantitative evaluation methodology exists |
| H3 | Supports | The need for 3-5 trials and confidence intervals to distinguish signal from noise demonstrates fundamental limitations |
Context¶
The 20-50 golden test case recommendation is strikingly small compared to traditional software test suites (which may have thousands). The need for multiple trials per test case and confidence intervals reflects a fundamentally statistical approach to testing — more akin to experimental science than software QA. This is perhaps the most diagnostic evidence for the H3 hypothesis: tools exist, but they operate under constraints that traditional testing does not face.