R0002/2026-03-13¶


Research	R0002 — Research Standards for AI-Assisted Writing
Mode	Claim
Run date	2026-03-13
Claims	12
Prompt	research-standard-claim (full-prompt-run-07)
Model	Claude Opus 4.6

12 claims investigated. Full entity decomposition with search logs, source scorecards, evidence extracts, ACH matrices, and self-audits.

Claims¶

C001 — ICD 203 tradecraft standards and probability scale — Likely

Claim: ICD 203 defines nine tradecraft standards (Sourcing, Uncertainty, Distinction, Alternatives, Relevance, Logic, Change, Accuracy, Visual integrity). Seven-point probability scale from Remote (01-05%) through Almost certain (95-99%).

Verdict: Structurally correct, label imprecisions

Hypothesis	Status	Probability
H1: Accurate in all details	Inconclusive	—
H2: Correct structure, inaccurate labels	Supported	Likely (55–80%)
H3: Material misrepresentation	Eliminated	—

Confidence: High · Sources: 2 · Searches: 6

Full analysis

C002 — GRADE framework for rating evidence quality — Almost certain

Claim: GRADE's core insight is that evidence quality and recommendation strength are independent axes. Four certainty levels, five downgrade criteria, three upgrade criteria.

Verdict: Fully confirmed

Hypothesis	Status	Probability
H1: Accurate in all details	Supported	Almost certain (95–99%)
H2: Partially correct	Inconclusive	—
H3: Material misrepresentation	Eliminated	—

Confidence: High · Sources: 2 · Searches: 4

Full analysis

C003 — IPCC calibrated uncertainty language — Likely

Claim: IPCC two-axis confidence model: Evidence quality (Limited, Medium, Robust) x Source agreement (Low, Medium, High), five confidence levels. Separate nine-point likelihood scale.

Verdict: Confirmed, count debatable

Hypothesis	Status	Probability
H1: Accurate in all details	Inconclusive	—
H2: Framework correct, count differs	Supported	Likely (55–80%)
H3: Material misrepresentation	Eliminated	—

Confidence: High · Sources: 2 · Searches: 5

Full analysis

C004 — PRISMA checklist and Mulrow 1987 — Almost certain

Claim: PRISMA exists because systematic reviews had abysmal reporting quality. Mulrow 1987 documented most reviews failed basic criteria.

Verdict: Fully confirmed

Hypothesis	Status	Probability
H1: Accurate in all details	Supported	Almost certain (95–99%)
H2: Partially correct	Inconclusive	—
H3: Material misrepresentation	Eliminated	—

Confidence: High · Sources: 3 · Searches: 5

Full analysis

C005 — Cochrane RoB 2 five domains of bias — Very likely

Claim: Cochrane's RoB 2 has five bias domains. COI/funding conspicuously absent.

Verdict: Confirmed, COI note

Hypothesis	Status	Probability
H1: Five domains, COI absent	Supported	Very likely (80–95%)
H2: Five domains, COI addressed elsewhere	Inconclusive	—
H3: Incorrect domain count	Eliminated	—

Confidence: High · Sources: 3 · Searches: 4

Full analysis

C006 — Chamberlin multiple hypotheses / Platt strong inference — Very likely

Claim: Chamberlin 1890/1897. Platt 1964 citing Chamberlin. "Parental affection" metaphor, step "1'", Baconian method reference.

Verdict: Confirmed, Baconian attribution nuance

Hypothesis	Status	Probability
H1: All elements accurate	Supported	Very likely (80–95%)
H2: Elements confirmed, attribution nuanced	Inconclusive	—
H3: Material misrepresentation	Eliminated	—

Confidence: High · Sources: 6 · Searches: 10

Full analysis

C007 — CONSORT 25-item checklist — Likely

Claim: CONSORT is a 25-item checklist for reporting randomized controlled trials.

Verdict: Accurate for 2010, outdated by 2025

Hypothesis	Status	Probability
H1: Accurate as stated	Inconclusive	—
H2: Correct for 2010, superseded	Supported	Likely (70%)
H3: Material misrepresentation	Eliminated	—

Confidence: High · Sources: 4 · Searches: 3

Full analysis

C008 — ROBIS four domains of bias — Almost certain

Claim: ROBIS assesses four domains of bias in systematic reviews.

Verdict: Confirmed

Hypothesis	Status	Probability
H1: Four domains confirmed	Supported	Almost certain (95–99%)
H2: Partially correct	Inconclusive	—
H3: Incorrect domain count	Eliminated	—

Confidence: High · Sources: 3 · Searches: 2

Full analysis

C009 — NAS 21 standards with 82 elements — Almost certain

Claim: NAS published 21 standards with 82 elements across four stages for systematic reviews.

Verdict: Confirmed, publisher attribution note

Hypothesis	Status	Probability
H1: Accurate in all details	Supported	Almost certain (95–99%)
H2: Correct counts, attribution nuance	Inconclusive	—
H3: Incorrect counts	Eliminated	—

Confidence: High · Sources: 3 · Searches: 3

Full analysis

C010 — No published systematic combination — Likely

Claim: No one has published a systematic combination of IC analytical frameworks with scientific research methodology frameworks into a unified, machine-executable prompt.

Verdict: Not refuted, incompletely verified

Hypothesis	Status	Probability
H1: No such combination exists	Supported	Likely (55–80%)
H2: Partial combinations exist	Inconclusive	—
H3: Full combination exists	Eliminated	—

Confidence: Medium · Sources: 2 · Searches: 4

Full analysis

C011 — Journalism principles-based, not methodology-based — Likely

Claim: Journalism and fact-checking are principles-based, not methodology-based. They lack formal evidence hierarchies, calibrated uncertainty scales, and structured bias assessment domains.

Verdict: Partially confirmed, oversimplified

Hypothesis	Status	Probability
H1: Purely principles-based	Inconclusive	—
H2: Principles-based with emerging methodology	Supported	Likely (65–75%)
H3: Fully methodology-based	Eliminated	—

Confidence: Medium · Sources: 5 · Searches: 5

Full analysis

C012 — Wardle/Derakhshan information disorder taxonomy — Very likely

Claim: Wardle and Derakhshan published an information disorder taxonomy in 2017 through the Council of Europe, distinguishing misinformation, disinformation, and malinformation based on intent to harm.

Verdict: Confirmed, two-dimensional clarification needed

Hypothesis	Status	Probability
H1: Accurate as stated	Supported	Very likely (80–85%)
H2: Correct framework, simplification	Inconclusive	—
H3: Material misrepresentation	Eliminated	—

Confidence: High · Sources: 5 · Searches: 3

Full analysis

Collection Analysis¶

The sections below are the product of analyzing the full collection of claims after individual claim investigation was complete. These findings are properties of the research run as a whole, not of any single claim.

Cross-Cutting Patterns¶

Pattern	Claims Affected	Significance
Precision downgrades	C001, C003, C007, C012	Claims that are structurally correct but imprecise in count or labeling were downgraded from baseline expectations. The prompt enforces differentiation between "accurate" and "precisely stated."
Temporal obsolescence	C007	CONSORT 2010 (25 items) superseded by CONSORT 2025 (30 items). Claims about evolving standards need version qualification.
Publisher attribution	C005, C009	Cochrane and NAS standards are attributed to organizational authors. The original institutional context matters for credibility assessment.
Novelty verification difficulty	C010	Proving a negative ("no published combination exists") is inherently harder than confirming a positive. This claim has the lowest confidence in the collection.
Principles vs. methodology	C011	The journalism claim oversimplifies a field that has both principles and methodology — but the article's point (that journalism lacks GRADE-equivalent formalization) holds.

Collection Statistics¶

Metric	Value
Claims investigated	12
Fully confirmed (Almost certain)	4 (C002, C004, C008, C009)
Confirmed with nuance (Very likely)	3 (C005, C006, C012)
Confirmed with caveats (Likely)	4 (C001, C003, C007, C011)
Incompletely verified (Likely, Medium confidence)	1 (C010)
High confidence assessments	10 of 12
Medium confidence assessments	2 of 12 (C010, C011)
Corrections recommended to source article	5 (C001, C003, C007, C011, C012)

Source Independence Assessment¶

Across all 12 claims, sources were drawn from primary standards documents, official organizational websites, peer-reviewed publications, and secondary reference sources. Key observations:

Primary documents dominate: Most claims could be verified against the original standard (ICD 203 text, GRADE handbook, IPCC guidance notes, etc.). This is a strength — the evidence base is not dependent on secondary interpretation.
Wikipedia as cross-check, not primary: Wikipedia was used as a convenience cross-reference for several claims but never as a sole source. Where Wikipedia and primary sources diverged, primary sources prevailed.
No circular sourcing detected: Sources citing each other were identified (e.g., EQUATOR Network referencing CONSORT) but these represent legitimate organizational relationships, not circular citation.

Collection Gaps¶

Gap	Impact	Mitigation
No access to paywalled primary documents for some claims	Could not verify exact wording in original publications (e.g., Schulz et al. 2010 for CONSORT)	Used official summary documents and organizational websites as proxies
Single research run	No inter-run comparison for this specific decomposition	A/B comparison across 10 runs provides statistical context (see experimental data)
English-language sources only	Non-English scholarship on these frameworks not consulted	All frameworks examined are published primarily in English; impact is low
No practitioner interviews	Verification is document-based only; real-world application nuances not captured	Scope was intentionally limited to published standards verification

Collection Self-Audit¶

Domain	Rating	Notes
Eligibility criteria	Low risk	All claims derived from a single source article with clear boundaries
Search comprehensiveness	Some concerns	Paywalled sources limited depth for some claims; mitigated by multiple alternative sources
Evaluation consistency	Low risk	Same scorecard framework applied to all sources across all claims
Synthesis fairness	Low risk	Disconfirming evidence surfaced for claims C001, C003, C007, C010, C011; none suppressed

Experimental Context¶

This research run was part of a controlled A/B experiment comparing baseline prompts against the full three-layer research standard. Key findings from the experimental comparison (10 runs, 31 of 32 agents completed):

Process compliance: Behavioral constraints (self-audit, search logging, COI flagging) showed 0% compliance without enforcement language, 100% with it
Calibration: Full-prompt agents downgraded 4 of 12 claims that baseline agents rated Almost Certain
Output depth: Full-prompt output averaged 1.5x the volume of baseline, driven by audit and methodology sections
Core finding: Describing a process and constraining a behavior produce measurably different results

Full experimental analysis is maintained in the article research directory.