R0020/2026-03-25/Q001 — Self-Audit¶
ROBIS 4-Domain Audit¶
Domain 1: Eligibility Criteria¶
Rating: Low risk
| Criterion | Assessment |
|---|---|
| Evidence types defined before searching | Yes — industry publications, framework documentation, and methodology guides targeted |
| Criteria consistent throughout | Yes — same relevance and reliability standards applied to all sources |
| Scope maintained | Yes — focused on prompt testing frameworks and methodologies throughout |
Notes: Eligibility criteria were stable. Only deviation was rejecting results about using prompts for software testing (inverse of the query) which was an appropriate scope refinement.
Domain 2: Search Comprehensiveness¶
Rating: Some concerns
| Criterion | Assessment |
|---|---|
| Multiple search strategies used | Yes — three distinct searches with different query terms |
| Searches designed to test each hypothesis | Partial — searches were designed to find frameworks (H1/H3) but no specific search targeted evidence against framework existence (H2) |
| All results dispositioned | Yes — 30 results returned, all dispositioned |
| Source diversity achieved | Partial — all sources are industry/vendor publications; no academic or peer-reviewed sources found |
Notes: The absence of academic sources is a genuine gap. No search specifically targeted academic databases or peer-reviewed research on prompt testing methodology. This limits the evidence base to vendor and industry perspectives.
Domain 3: Evaluation Consistency¶
Rating: Low risk
| Criterion | Assessment |
|---|---|
| All sources scored using same framework | Yes — same reliability/relevance/bias dimensions applied |
| Evidence typed consistently | Yes — Reported, Analytical, and Factual types applied consistently |
| ACH matrix applied | Yes — all evidence mapped to all hypotheses |
| Diagnosticity analysis performed | Yes — most and least diagnostic evidence identified |
Notes: Consistent application of evaluation framework across all sources.
Domain 4: Synthesis Fairness¶
Rating: Low risk
| Criterion | Assessment |
|---|---|
| All hypotheses given fair hearing | Yes — H1 received partial support, H2 was tested and eliminated, H3 received strongest support |
| Contradictory evidence surfaced | Yes — limitations and challenges were prominently documented alongside tool existence |
| Confidence calibrated to evidence | Yes — Medium confidence reflects the vendor-only evidence base |
| Gaps acknowledged | Yes — absence of academic sources and longitudinal data explicitly noted |
Notes: The synthesis appropriately balances the existence of tools against their acknowledged limitations.
Overall Assessment¶
Overall risk of bias: Low risk
The primary limitation is the vendor-dominated evidence base (Domain 2 concern), which is inherent to the subject matter — prompt testing frameworks are industry products documented by industry sources. The absence of academic evaluation is itself a finding that supports H3.
Researcher Bias Check¶
- Confirmation bias risk: Low. The query framing ("how is this tested?") could lead to overemphasis on testing solutions, but the research surfaced limitations and challenges prominently.
- Availability bias risk: Some concern. All sources are web-accessible vendor publications, which may overrepresent marketed tools and underrepresent internal/proprietary testing approaches used by major AI labs.