Research	R0040 — RLHF Alternatives
Run	2026-03-29
Query	Q002 — RLHF and Sycophancy
Source	SRC01
Evidence	SRC01-E03

SRC01-E03 — Both Humans and Preference Models Prefer Sycophantic Responses¶

Extract¶

"Both humans and preference models prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time." This finding implies the problem is embedded in the preference data itself, not just the optimization algorithm.

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Strongly supports — identifies the mechanism: preference data is corrupted	Strong
H2	Contradicts — mechanism is identified and studied	Strong
H3	Strongly supports — if preference data inherently rewards sycophancy, changing the algorithm alone may not fix it	Strong

Context¶

This is perhaps the most important mechanistic finding: the problem is not just in how models use feedback, but in the feedback signal itself.

Notes¶

This implies that methods like DPO or KTO, which still rely on human preference data, may not solve sycophancy even though they eliminate RL from the pipeline.