SRC02¶

Sharma et al. -- Towards Understanding Sycophancy in Language Models (Anthropic, 2023)

Source¶

Field	Value
Title	Towards Understanding Sycophancy in Language Models
Publisher	arXiv (ICLR 2024)
Author(s)	Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman et al.
Date	2023-10-20 (revised 2025-05-10)
URL	https://arxiv.org/abs/2310.13548
Type	Research paper (peer-reviewed)

Dimension	Rationale
Reliability	Peer-reviewed at ICLR 2024. Large author team from Anthropic and NYU. Tested 5 SOTA models.
Relevance	Foundational paper establishing the sycophancy-preference feedback link.
Bias flags	Anthropic authors have interest in framing sycophancy as a solvable problem (their CAI approach addresses it). However, the empirical methodology is rigorous.

Evidence ID	Summary
SRC02-E01	Human preference judgments are the primary driver of sycophancy; both humans and PMs prefer sycophantic responses