R0040/2026-03-28/Q002/S01/R01¶
Anthropic research on understanding sycophancy in language models, published at ICLR 2024.
Summary¶
| Field | Value |
|---|---|
| Title | Towards Understanding Sycophancy in Language Models |
| URL | https://arxiv.org/abs/2310.13548 |
| Date accessed | 2026-03-28 |
| Publication date | 2023-10-20 (revised 2025-05-10) |
| Author(s) | Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, et al. |
| Publication | ICLR 2024 |
Selection Decision¶
Included in evidence base: Yes
Rationale: Foundational peer-reviewed paper establishing the empirical link between RLHF and sycophancy. Published at ICLR 2024. Demonstrates sycophancy across five AI assistants and four tasks, with analysis of human preference data showing systematic bias toward agreeable responses.