R01¶


Research	R0040 — RLHF Alternatives
Run	2026-03-28
Query	Q002
Search	S01
Result	S01-R01

Anthropic research on understanding sycophancy in language models, published at ICLR 2024.

Summary¶

Field	Value
Title	Towards Understanding Sycophancy in Language Models
URL	https://arxiv.org/abs/2310.13548
Date accessed	2026-03-28
Publication date	2023-10-20 (revised 2025-05-10)
Author(s)	Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, et al.
Publication	ICLR 2024

Selection Decision¶

Included in evidence base: Yes

Rationale: Foundational peer-reviewed paper establishing the empirical link between RLHF and sycophancy. Published at ICLR 2024. Demonstrates sycophancy across five AI assistants and four tasks, with analysis of human preference data showing systematic bias toward agreeable responses.