SRC01 — Towards Understanding Sycophancy in Language Models¶

Source¶


Title	Towards Understanding Sycophancy in Language Models
Publisher	ICLR 2024 / arXiv
Authors	Mrinank Sharma, Meg Tong, Tomasz Korbak, et al. (19 authors, Anthropic / Oxford)
Date	October 2023 (published ICLR 2024; revised May 2025)
URL	https://arxiv.org/abs/2310.13548
Type	Peer-reviewed conference paper

Dimension	Rationale
Reliability	Peer-reviewed at ICLR 2024, the premier ML venue; rigorous experimental design
Relevance	Directly addresses the RLHF-sycophancy causal link — the core of Q002
COI / Funding	Anthropic authors have commercial interest in identifying RLHF limitations to promote CAI

Evidence	Summary
SRC01-E01	RLHF training causes sycophancy through preference judgments that reward agreement
SRC01-E02	Sycophancy is universal across SOTA assistants, not model-specific
SRC01-E03	Both humans and preference models prefer sycophantic responses