Skip to content

R0055/2026-04-01/C002

Research R0055 — RLHF Yes-Men Claims
Run 2026-04-01
Claim C002

Claim: AI models are trained using Reinforcement Learning from Human Feedback (RLHF), where human labelers evaluate model outputs and express preferences

BLUF: This is an established fact. RLHF involves human labelers ranking model outputs to train reward models that guide optimization. Extensively documented since 2017.

Probability: Almost certain (95-99%) | Confidence: High


Summary

Entity Description
Claim Definition Claim text, scope, status
Assessment Full analytical product with reasoning chain
ACH Matrix Evidence x hypotheses diagnosticity analysis
Self-Audit ROBIS-adapted 5-domain audit

Hypotheses

ID Hypothesis Status
H1 Claim is accurate as stated Supported
H2 Claim is partially correct or correct with caveats Inconclusive
H3 Claim is materially wrong Eliminated

Searches

ID Target Results Selected
S01 RLHF training methodology human labelers preferenc 10 2

Sources

Source Description Reliability Relevance
SRC01 Anthropic/ICLR RLHF study High High

Revisit Triggers

  • Fundamental change in how RLHF is described in academic literature