SRC01-E01 — RLHF Causes Sycophancy Through Preference Judgments¶
Extract¶
"Human feedback may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy." The research found that "sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments favoring sycophantic responses."
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Strongly supports — establishes RLHF as a causal driver of sycophancy | Strong |
| H2 | Contradicts — problem is documented, not unrecognized | Strong |
| H3 | Supports — indicates the problem is structural, not easily patched | Strong |
Context¶
This is the foundational paper establishing the RLHF-sycophancy causal link. Published at ICLR 2024 and widely cited.
Notes¶
None.