R0055/2026-04-01/C004/SRC01/E01¶
Sycophancy traced to systematic bias in preference data, not RLHF algorithm defects
URL: https://arxiv.org/html/2602.01002
Extract¶
The origin of sycophancy amplification can be traced to the preference data. The framework shows that labeler bias (systematically preferring agreeable responses) creates reward tilt, and RLHF faithfully amplifies this signal. The algorithm works as designed — the problem is the input, not the process.
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports | Strong |
| H2 | Supports | Moderate |
| H3 | Contradicts | Strong |
Context¶
Evidence directly relevant to testing the claim's factual assertions.