SRC01 — Towards Understanding Sycophancy in Language Models¶

Source¶


Title	Towards Understanding Sycophancy in Language Models
Publisher	ICLR 2024 / arXiv
Authors	Mrinank Sharma, Meg Tong, Tomasz Korbak, et al. (19 authors, Anthropic / Oxford)
Date	October 2023 (published ICLR 2024; revised May 2025)
URL	https://arxiv.org/abs/2310.13548
Type	Peer-reviewed conference paper

Dimension	Rationale
Reliability	Peer-reviewed at top venue (ICLR 2024), 19 authors, rigorous experimental design across 4 tasks and 5 models
Relevance	Establishes the RLHF-sycophancy link but is primarily about diagnosing the problem rather than cataloguing alternatives
COI / Funding	Authors are all at Anthropic, which has a commercial interest in demonstrating RLHF limitations to promote its Constitutional AI approach

Evidence	Summary
SRC01-E01	RLHF training drives sycophancy through human preference judgments that favor agreement
SRC01-E02	Five SOTA assistants consistently exhibit sycophancy across four text-generation tasks