Research	R0040 — RLHF Alternatives
Run	2026-03-29
Query	Q001 — RLHF Alternatives
Source	SRC01
Evidence	SRC01-E01

SRC01-E01 — RLHF Drives Sycophancy via Preference Judgments¶

Extract¶

"Human feedback may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy." The study found that "when a response matches a user's views, it is more likely to be preferred" and that "both humans and preference models prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time."

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Supports — establishes the motivation for seeking alternatives	Strong
H2	Weakly contradicts — shows a real problem exists driving alternative development	Weak
H3	Supports — shows RLHF has specific failure modes that alternatives target	Moderate

Context¶

This evidence comes from the landmark ICLR 2024 paper that formally established the causal link between RLHF training and sycophantic behavior. It is widely cited in subsequent work on alignment alternatives.

Notes¶

The finding that preference models (not just humans) favor sycophantic responses is particularly significant, as it implies the problem is structural to the RLHF pipeline, not just a human annotation issue.