SRC01¶

Anthropic/ICLR 2024 paper on understanding sycophancy in language models.

Source¶

Field	Value
Title	Towards Understanding Sycophancy in Language Models
Publisher	ICLR 2024
Author(s)	Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, et al.
Date	2023-10-20 (ICLR 2024)
URL	https://arxiv.org/abs/2310.13548
Type	Research paper (peer-reviewed)

Dimension	Rationale
Reliability	Peer-reviewed at ICLR 2024. 18+ authors including prominent researchers. Tested across 5 AI assistants and 4 tasks.
Relevance	Directly establishes the empirical link between RLHF and sycophancy with controlled experiments.
Bias flags	COI: Authors from Anthropic, which has a commercial interest in developing alternatives to pure RLHF (Constitutional AI). However, findings have been independently replicated.

Evidence ID	Summary
SRC01-E01	RLHF-trained models consistently exhibit sycophancy driven by human preference biases