R0002/2026-03-13¶
12 claims investigated. Full entity decomposition with search logs, source scorecards, evidence extracts, ACH matrices, and self-audits.
Claims¶
C001 — ICD 203 tradecraft standards and probability scale — Likely
Claim: ICD 203 defines nine tradecraft standards (Sourcing, Uncertainty, Distinction, Alternatives, Relevance, Logic, Change, Accuracy, Visual integrity). Seven-point probability scale from Remote (01-05%) through Almost certain (95-99%).
Verdict: Structurally correct, label imprecisions
| Hypothesis | Status | Probability |
|---|---|---|
| H1: Accurate in all details | Inconclusive | — |
| H2: Correct structure, inaccurate labels | Supported | Likely (55–80%) |
| H3: Material misrepresentation | Eliminated | — |
Confidence: High · Sources: 2 · Searches: 6
C002 — GRADE framework for rating evidence quality — Almost certain
Claim: GRADE's core insight is that evidence quality and recommendation strength are independent axes. Four certainty levels, five downgrade criteria, three upgrade criteria.
Verdict: Fully confirmed
| Hypothesis | Status | Probability |
|---|---|---|
| H1: Accurate in all details | Supported | Almost certain (95–99%) |
| H2: Partially correct | Inconclusive | — |
| H3: Material misrepresentation | Eliminated | — |
Confidence: High · Sources: 2 · Searches: 4
C003 — IPCC calibrated uncertainty language — Likely
Claim: IPCC two-axis confidence model: Evidence quality (Limited, Medium, Robust) x Source agreement (Low, Medium, High), five confidence levels. Separate nine-point likelihood scale.
Verdict: Confirmed, count debatable
| Hypothesis | Status | Probability |
|---|---|---|
| H1: Accurate in all details | Inconclusive | — |
| H2: Framework correct, count differs | Supported | Likely (55–80%) |
| H3: Material misrepresentation | Eliminated | — |
Confidence: High · Sources: 2 · Searches: 5
C004 — PRISMA checklist and Mulrow 1987 — Almost certain
Claim: PRISMA exists because systematic reviews had abysmal reporting quality. Mulrow 1987 documented most reviews failed basic criteria.
Verdict: Fully confirmed
| Hypothesis | Status | Probability |
|---|---|---|
| H1: Accurate in all details | Supported | Almost certain (95–99%) |
| H2: Partially correct | Inconclusive | — |
| H3: Material misrepresentation | Eliminated | — |
Confidence: High · Sources: 3 · Searches: 5
C005 — Cochrane RoB 2 five domains of bias — Very likely
Claim: Cochrane's RoB 2 has five bias domains. COI/funding conspicuously absent.
Verdict: Confirmed, COI note
| Hypothesis | Status | Probability |
|---|---|---|
| H1: Five domains, COI absent | Supported | Very likely (80–95%) |
| H2: Five domains, COI addressed elsewhere | Inconclusive | — |
| H3: Incorrect domain count | Eliminated | — |
Confidence: High · Sources: 3 · Searches: 4
C006 — Chamberlin multiple hypotheses / Platt strong inference — Very likely
Claim: Chamberlin 1890/1897. Platt 1964 citing Chamberlin. "Parental affection" metaphor, step "1'", Baconian method reference.
Verdict: Confirmed, Baconian attribution nuance
| Hypothesis | Status | Probability |
|---|---|---|
| H1: All elements accurate | Supported | Very likely (80–95%) |
| H2: Elements confirmed, attribution nuanced | Inconclusive | — |
| H3: Material misrepresentation | Eliminated | — |
Confidence: High · Sources: 6 · Searches: 10
C007 — CONSORT 25-item checklist — Likely
Claim: CONSORT is a 25-item checklist for reporting randomized controlled trials.
Verdict: Accurate for 2010, outdated by 2025
| Hypothesis | Status | Probability |
|---|---|---|
| H1: Accurate as stated | Inconclusive | — |
| H2: Correct for 2010, superseded | Supported | Likely (70%) |
| H3: Material misrepresentation | Eliminated | — |
Confidence: High · Sources: 4 · Searches: 3
C008 — ROBIS four domains of bias — Almost certain
Claim: ROBIS assesses four domains of bias in systematic reviews.
Verdict: Confirmed
| Hypothesis | Status | Probability |
|---|---|---|
| H1: Four domains confirmed | Supported | Almost certain (95–99%) |
| H2: Partially correct | Inconclusive | — |
| H3: Incorrect domain count | Eliminated | — |
Confidence: High · Sources: 3 · Searches: 2
C009 — NAS 21 standards with 82 elements — Almost certain
Claim: NAS published 21 standards with 82 elements across four stages for systematic reviews.
Verdict: Confirmed, publisher attribution note
| Hypothesis | Status | Probability |
|---|---|---|
| H1: Accurate in all details | Supported | Almost certain (95–99%) |
| H2: Correct counts, attribution nuance | Inconclusive | — |
| H3: Incorrect counts | Eliminated | — |
Confidence: High · Sources: 3 · Searches: 3
C010 — No published systematic combination — Likely
Claim: No one has published a systematic combination of IC analytical frameworks with scientific research methodology frameworks into a unified, machine-executable prompt.
Verdict: Not refuted, incompletely verified
| Hypothesis | Status | Probability |
|---|---|---|
| H1: No such combination exists | Supported | Likely (55–80%) |
| H2: Partial combinations exist | Inconclusive | — |
| H3: Full combination exists | Eliminated | — |
Confidence: Medium · Sources: 2 · Searches: 4
C011 — Journalism principles-based, not methodology-based — Likely
Claim: Journalism and fact-checking are principles-based, not methodology-based. They lack formal evidence hierarchies, calibrated uncertainty scales, and structured bias assessment domains.
Verdict: Partially confirmed, oversimplified
| Hypothesis | Status | Probability |
|---|---|---|
| H1: Purely principles-based | Inconclusive | — |
| H2: Principles-based with emerging methodology | Supported | Likely (65–75%) |
| H3: Fully methodology-based | Eliminated | — |
Confidence: Medium · Sources: 5 · Searches: 5
C012 — Wardle/Derakhshan information disorder taxonomy — Very likely
Claim: Wardle and Derakhshan published an information disorder taxonomy in 2017 through the Council of Europe, distinguishing misinformation, disinformation, and malinformation based on intent to harm.
Verdict: Confirmed, two-dimensional clarification needed
| Hypothesis | Status | Probability |
|---|---|---|
| H1: Accurate as stated | Supported | Very likely (80–85%) |
| H2: Correct framework, simplification | Inconclusive | — |
| H3: Material misrepresentation | Eliminated | — |
Confidence: High · Sources: 5 · Searches: 3
Collection Analysis¶
The sections below are the product of analyzing the full collection of claims after individual claim investigation was complete. These findings are properties of the research run as a whole, not of any single claim.
Cross-Cutting Patterns¶
| Pattern | Claims Affected | Significance |
|---|---|---|
| Precision downgrades | C001, C003, C007, C012 | Claims that are structurally correct but imprecise in count or labeling were downgraded from baseline expectations. The prompt enforces differentiation between "accurate" and "precisely stated." |
| Temporal obsolescence | C007 | CONSORT 2010 (25 items) superseded by CONSORT 2025 (30 items). Claims about evolving standards need version qualification. |
| Publisher attribution | C005, C009 | Cochrane and NAS standards are attributed to organizational authors. The original institutional context matters for credibility assessment. |
| Novelty verification difficulty | C010 | Proving a negative ("no published combination exists") is inherently harder than confirming a positive. This claim has the lowest confidence in the collection. |
| Principles vs. methodology | C011 | The journalism claim oversimplifies a field that has both principles and methodology — but the article's point (that journalism lacks GRADE-equivalent formalization) holds. |
Collection Statistics¶
Source Independence Assessment¶
Across all 12 claims, sources were drawn from primary standards documents, official organizational websites, peer-reviewed publications, and secondary reference sources. Key observations:
- Primary documents dominate: Most claims could be verified against the original standard (ICD 203 text, GRADE handbook, IPCC guidance notes, etc.). This is a strength — the evidence base is not dependent on secondary interpretation.
- Wikipedia as cross-check, not primary: Wikipedia was used as a convenience cross-reference for several claims but never as a sole source. Where Wikipedia and primary sources diverged, primary sources prevailed.
- No circular sourcing detected: Sources citing each other were identified (e.g., EQUATOR Network referencing CONSORT) but these represent legitimate organizational relationships, not circular citation.
Collection Gaps¶
| Gap | Impact | Mitigation |
|---|---|---|
| No access to paywalled primary documents for some claims | Could not verify exact wording in original publications (e.g., Schulz et al. 2010 for CONSORT) | Used official summary documents and organizational websites as proxies |
| Single research run | No inter-run comparison for this specific decomposition | A/B comparison across 10 runs provides statistical context (see experimental data) |
| English-language sources only | Non-English scholarship on these frameworks not consulted | All frameworks examined are published primarily in English; impact is low |
| No practitioner interviews | Verification is document-based only; real-world application nuances not captured | Scope was intentionally limited to published standards verification |
Collection Self-Audit¶
| Domain | Rating | Notes |
|---|---|---|
| Eligibility criteria | Low risk | All claims derived from a single source article with clear boundaries |
| Search comprehensiveness | Some concerns | Paywalled sources limited depth for some claims; mitigated by multiple alternative sources |
| Evaluation consistency | Low risk | Same scorecard framework applied to all sources across all claims |
| Synthesis fairness | Low risk | Disconfirming evidence surfaced for claims C001, C003, C007, C010, C011; none suppressed |
Experimental Context¶
This research run was part of a controlled A/B experiment comparing baseline prompts against the full three-layer research standard. Key findings from the experimental comparison (10 runs, 31 of 32 agents completed):
- Process compliance: Behavioral constraints (self-audit, search logging, COI flagging) showed 0% compliance without enforcement language, 100% with it
- Calibration: Full-prompt agents downgraded 4 of 12 claims that baseline agents rated Almost Certain
- Output depth: Full-prompt output averaged 1.5x the volume of baseline, driven by audit and methodology sections
- Core finding: Describing a process and constraining a behavior produce measurably different results
Full experimental analysis is maintained in the article research directory.