Skip to content

R0057/2026-04-01/C003/H1

Research R0057 — RLHF Yes-Men Claims v3
Run 2026-04-01
Claim C003
Hypothesis H1

Statement

The claim accurately captures the paper's attribution

Status

Current: Supported

Supporting Evidence

Evidence Summary
SRC01-E01 Sycophancy attributed to systematic bias in human annotator preferences, not algorithmic failures in RLHF

Contradicting Evidence

Evidence Summary
No contradicting evidence found

Reasoning

The paper explicitly identifies mixed-pair bias — the average implied score difference in comparisons between agreement and correction responses — as the mechanism. The algorithm works correctly; it optimizes a biased objective.

Relationship to Other Hypotheses

H1 represents full accuracy. H2 allows for partial correctness. H3 is eliminated by the evidence.