R0057/2026-04-01/C003/SRC01/E01¶
Sycophancy attributed to systematic bias in human annotator preferences, not algorithmic failures in RLHF
URL: https://arxiv.org/html/2602.01002
Extract¶
The paper explicitly identifies mixed-pair bias — the average implied score difference in comparisons between agreement and correction responses — as the mechanism. The algorithm works correctly; it optimizes a biased objective.
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports | Directly addresses claim accuracy |
| H2 | Supports | Allows for partial correctness |
| H3 | Contradicts | Evidence contradicts material inaccuracy |
Context¶
The formal proofs trace sycophancy to annotator preferences, not to failures in the optimization algorithm itself.