Skip to content

R0055/2026-04-01/C004/SRC01/E01

Research R0055 — RLHF Yes-Men Claims
Run 2026-04-01
Claim C004
Source SRC01
Evidence SRC01-E01
Type Analytical

Sycophancy traced to systematic bias in preference data, not RLHF algorithm defects

URL: https://arxiv.org/html/2602.01002

Extract

The origin of sycophancy amplification can be traced to the preference data. The framework shows that labeler bias (systematically preferring agreeable responses) creates reward tilt, and RLHF faithfully amplifies this signal. The algorithm works as designed — the problem is the input, not the process.

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Supports Strong
H2 Supports Moderate
H3 Contradicts Strong

Context

Evidence directly relevant to testing the claim's factual assertions.