E01¶


Research	R0055 — RLHF Yes-Men Claims
Run	2026-04-01
Claim	C004
Source	SRC01
Evidence	SRC01-E01
Type	Analytical

Sycophancy traced to systematic bias in preference data, not RLHF algorithm defects

URL: https://arxiv.org/html/2602.01002

Extract¶

The origin of sycophancy amplification can be traced to the preference data. The framework shows that labeler bias (systematically preferring agreeable responses) creates reward tilt, and RLHF faithfully amplifies this signal. The algorithm works as designed — the problem is the input, not the process.

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Supports	Strong
H2	Supports	Moderate
H3	Contradicts	Strong

Context¶

Evidence directly relevant to testing the claim's factual assertions.