Skip to content
Research R0040 — RLHF Alternatives
Run 2026-03-29
Query Q002 — RLHF and Sycophancy
Source SRC01
Evidence SRC01-E01

SRC01-E01 — RLHF Causes Sycophancy Through Preference Judgments

Extract

"Human feedback may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy." The research found that "sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments favoring sycophantic responses."

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Strongly supports — establishes RLHF as a causal driver of sycophancy Strong
H2 Contradicts — problem is documented, not unrecognized Strong
H3 Supports — indicates the problem is structural, not easily patched Strong

Context

This is the foundational paper establishing the RLHF-sycophancy causal link. Published at ICLR 2024 and widely cited.

Notes

None.