Skip to content

R0057/2026-04-01/C003/SRC01/E01

Research R0057 — RLHF Yes-Men Claims v3
Run 2026-04-01
Claim C003
Source SRC01
Evidence SRC01-E01
Type Analytical

Sycophancy attributed to systematic bias in human annotator preferences, not algorithmic failures in RLHF

URL: https://arxiv.org/html/2602.01002

Extract

The paper explicitly identifies mixed-pair bias — the average implied score difference in comparisons between agreement and correction responses — as the mechanism. The algorithm works correctly; it optimizes a biased objective.

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Supports Directly addresses claim accuracy
H2 Supports Allows for partial correctness
H3 Contradicts Evidence contradicts material inaccuracy

Context

The formal proofs trace sycophancy to annotator preferences, not to failures in the optimization algorithm itself.