Skip to content

R0002/2026-03-13

Research R0002 — Research Standards for AI-Assisted Writing
Mode Claim
Run date 2026-03-13
Claims 12
Prompt research-standard-claim (full-prompt-run-07)
Model Claude Opus 4.6

12 claims investigated. Full entity decomposition with search logs, source scorecards, evidence extracts, ACH matrices, and self-audits.


Claims

C001 — ICD 203 tradecraft standards and probability scale — Likely

Claim: ICD 203 defines nine tradecraft standards (Sourcing, Uncertainty, Distinction, Alternatives, Relevance, Logic, Change, Accuracy, Visual integrity). Seven-point probability scale from Remote (01-05%) through Almost certain (95-99%).

Verdict: Structurally correct, label imprecisions

Hypothesis Status Probability
H1: Accurate in all details Inconclusive
H2: Correct structure, inaccurate labels Supported Likely (55–80%)
H3: Material misrepresentation Eliminated

Confidence: High · Sources: 2 · Searches: 6

Full analysis

C002 — GRADE framework for rating evidence quality — Almost certain

Claim: GRADE's core insight is that evidence quality and recommendation strength are independent axes. Four certainty levels, five downgrade criteria, three upgrade criteria.

Verdict: Fully confirmed

Hypothesis Status Probability
H1: Accurate in all details Supported Almost certain (95–99%)
H2: Partially correct Inconclusive
H3: Material misrepresentation Eliminated

Confidence: High · Sources: 2 · Searches: 4

Full analysis

C003 — IPCC calibrated uncertainty language — Likely

Claim: IPCC two-axis confidence model: Evidence quality (Limited, Medium, Robust) x Source agreement (Low, Medium, High), five confidence levels. Separate nine-point likelihood scale.

Verdict: Confirmed, count debatable

Hypothesis Status Probability
H1: Accurate in all details Inconclusive
H2: Framework correct, count differs Supported Likely (55–80%)
H3: Material misrepresentation Eliminated

Confidence: High · Sources: 2 · Searches: 5

Full analysis

C004 — PRISMA checklist and Mulrow 1987 — Almost certain

Claim: PRISMA exists because systematic reviews had abysmal reporting quality. Mulrow 1987 documented most reviews failed basic criteria.

Verdict: Fully confirmed

Hypothesis Status Probability
H1: Accurate in all details Supported Almost certain (95–99%)
H2: Partially correct Inconclusive
H3: Material misrepresentation Eliminated

Confidence: High · Sources: 3 · Searches: 5

Full analysis

C005 — Cochrane RoB 2 five domains of bias — Very likely

Claim: Cochrane's RoB 2 has five bias domains. COI/funding conspicuously absent.

Verdict: Confirmed, COI note

Hypothesis Status Probability
H1: Five domains, COI absent Supported Very likely (80–95%)
H2: Five domains, COI addressed elsewhere Inconclusive
H3: Incorrect domain count Eliminated

Confidence: High · Sources: 3 · Searches: 4

Full analysis

C006 — Chamberlin multiple hypotheses / Platt strong inference — Very likely

Claim: Chamberlin 1890/1897. Platt 1964 citing Chamberlin. "Parental affection" metaphor, step "1'", Baconian method reference.

Verdict: Confirmed, Baconian attribution nuance

Hypothesis Status Probability
H1: All elements accurate Supported Very likely (80–95%)
H2: Elements confirmed, attribution nuanced Inconclusive
H3: Material misrepresentation Eliminated

Confidence: High · Sources: 6 · Searches: 10

Full analysis

C007 — CONSORT 25-item checklist — Likely

Claim: CONSORT is a 25-item checklist for reporting randomized controlled trials.

Verdict: Accurate for 2010, outdated by 2025

Hypothesis Status Probability
H1: Accurate as stated Inconclusive
H2: Correct for 2010, superseded Supported Likely (70%)
H3: Material misrepresentation Eliminated

Confidence: High · Sources: 4 · Searches: 3

Full analysis

C008 — ROBIS four domains of bias — Almost certain

Claim: ROBIS assesses four domains of bias in systematic reviews.

Verdict: Confirmed

Hypothesis Status Probability
H1: Four domains confirmed Supported Almost certain (95–99%)
H2: Partially correct Inconclusive
H3: Incorrect domain count Eliminated

Confidence: High · Sources: 3 · Searches: 2

Full analysis

C009 — NAS 21 standards with 82 elements — Almost certain

Claim: NAS published 21 standards with 82 elements across four stages for systematic reviews.

Verdict: Confirmed, publisher attribution note

Hypothesis Status Probability
H1: Accurate in all details Supported Almost certain (95–99%)
H2: Correct counts, attribution nuance Inconclusive
H3: Incorrect counts Eliminated

Confidence: High · Sources: 3 · Searches: 3

Full analysis

C010 — No published systematic combination — Likely

Claim: No one has published a systematic combination of IC analytical frameworks with scientific research methodology frameworks into a unified, machine-executable prompt.

Verdict: Not refuted, incompletely verified

Hypothesis Status Probability
H1: No such combination exists Supported Likely (55–80%)
H2: Partial combinations exist Inconclusive
H3: Full combination exists Eliminated

Confidence: Medium · Sources: 2 · Searches: 4

Full analysis

C011 — Journalism principles-based, not methodology-based — Likely

Claim: Journalism and fact-checking are principles-based, not methodology-based. They lack formal evidence hierarchies, calibrated uncertainty scales, and structured bias assessment domains.

Verdict: Partially confirmed, oversimplified

Hypothesis Status Probability
H1: Purely principles-based Inconclusive
H2: Principles-based with emerging methodology Supported Likely (65–75%)
H3: Fully methodology-based Eliminated

Confidence: Medium · Sources: 5 · Searches: 5

Full analysis

C012 — Wardle/Derakhshan information disorder taxonomy — Very likely

Claim: Wardle and Derakhshan published an information disorder taxonomy in 2017 through the Council of Europe, distinguishing misinformation, disinformation, and malinformation based on intent to harm.

Verdict: Confirmed, two-dimensional clarification needed

Hypothesis Status Probability
H1: Accurate as stated Supported Very likely (80–85%)
H2: Correct framework, simplification Inconclusive
H3: Material misrepresentation Eliminated

Confidence: High · Sources: 5 · Searches: 3

Full analysis


Collection Analysis

The sections below are the product of analyzing the full collection of claims after individual claim investigation was complete. These findings are properties of the research run as a whole, not of any single claim.

Cross-Cutting Patterns

Pattern Claims Affected Significance
Precision downgrades C001, C003, C007, C012 Claims that are structurally correct but imprecise in count or labeling were downgraded from baseline expectations. The prompt enforces differentiation between "accurate" and "precisely stated."
Temporal obsolescence C007 CONSORT 2010 (25 items) superseded by CONSORT 2025 (30 items). Claims about evolving standards need version qualification.
Publisher attribution C005, C009 Cochrane and NAS standards are attributed to organizational authors. The original institutional context matters for credibility assessment.
Novelty verification difficulty C010 Proving a negative ("no published combination exists") is inherently harder than confirming a positive. This claim has the lowest confidence in the collection.
Principles vs. methodology C011 The journalism claim oversimplifies a field that has both principles and methodology — but the article's point (that journalism lacks GRADE-equivalent formalization) holds.

Collection Statistics

Metric Value
Claims investigated 12
Fully confirmed (Almost certain) 4 (C002, C004, C008, C009)
Confirmed with nuance (Very likely) 3 (C005, C006, C012)
Confirmed with caveats (Likely) 4 (C001, C003, C007, C011)
Incompletely verified (Likely, Medium confidence) 1 (C010)
High confidence assessments 10 of 12
Medium confidence assessments 2 of 12 (C010, C011)
Corrections recommended to source article 5 (C001, C003, C007, C011, C012)

Source Independence Assessment

Across all 12 claims, sources were drawn from primary standards documents, official organizational websites, peer-reviewed publications, and secondary reference sources. Key observations:

  • Primary documents dominate: Most claims could be verified against the original standard (ICD 203 text, GRADE handbook, IPCC guidance notes, etc.). This is a strength — the evidence base is not dependent on secondary interpretation.
  • Wikipedia as cross-check, not primary: Wikipedia was used as a convenience cross-reference for several claims but never as a sole source. Where Wikipedia and primary sources diverged, primary sources prevailed.
  • No circular sourcing detected: Sources citing each other were identified (e.g., EQUATOR Network referencing CONSORT) but these represent legitimate organizational relationships, not circular citation.

Collection Gaps

Gap Impact Mitigation
No access to paywalled primary documents for some claims Could not verify exact wording in original publications (e.g., Schulz et al. 2010 for CONSORT) Used official summary documents and organizational websites as proxies
Single research run No inter-run comparison for this specific decomposition A/B comparison across 10 runs provides statistical context (see experimental data)
English-language sources only Non-English scholarship on these frameworks not consulted All frameworks examined are published primarily in English; impact is low
No practitioner interviews Verification is document-based only; real-world application nuances not captured Scope was intentionally limited to published standards verification

Collection Self-Audit

Domain Rating Notes
Eligibility criteria Low risk All claims derived from a single source article with clear boundaries
Search comprehensiveness Some concerns Paywalled sources limited depth for some claims; mitigated by multiple alternative sources
Evaluation consistency Low risk Same scorecard framework applied to all sources across all claims
Synthesis fairness Low risk Disconfirming evidence surfaced for claims C001, C003, C007, C010, C011; none suppressed

Experimental Context

This research run was part of a controlled A/B experiment comparing baseline prompts against the full three-layer research standard. Key findings from the experimental comparison (10 runs, 31 of 32 agents completed):

  • Process compliance: Behavioral constraints (self-audit, search logging, COI flagging) showed 0% compliance without enforcement language, 100% with it
  • Calibration: Full-prompt agents downgraded 4 of 12 claims that baseline agents rated Almost Certain
  • Output depth: Full-prompt output averaged 1.5x the volume of baseline, driven by audit and methodology sections
  • Core finding: Describing a process and constraining a behavior produce measurably different results

Full experimental analysis is maintained in the article research directory.