SRC01-E03 — Both Humans and Preference Models Prefer Sycophantic Responses¶
Extract¶
"Both humans and preference models prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time." This finding implies the problem is embedded in the preference data itself, not just the optimization algorithm.
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Strongly supports — identifies the mechanism: preference data is corrupted | Strong |
| H2 | Contradicts — mechanism is identified and studied | Strong |
| H3 | Strongly supports — if preference data inherently rewards sycophancy, changing the algorithm alone may not fix it | Strong |
Context¶
This is perhaps the most important mechanistic finding: the problem is not just in how models use feedback, but in the feedback signal itself.
Notes¶
This implies that methods like DPO or KTO, which still rely on human preference data, may not solve sycophancy even though they eliminate RL from the pipeline.