Skip to content

R0040/2026-03-28/Q002/S01/R01

Research R0040 — RLHF Alternatives
Run 2026-03-28
Query Q002
Search S01
Result S01-R01

Anthropic research on understanding sycophancy in language models, published at ICLR 2024.

Summary

Field Value
Title Towards Understanding Sycophancy in Language Models
URL https://arxiv.org/abs/2310.13548
Date accessed 2026-03-28
Publication date 2023-10-20 (revised 2025-05-10)
Author(s) Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, et al.
Publication ICLR 2024

Selection Decision

Included in evidence base: Yes

Rationale: Foundational peer-reviewed paper establishing the empirical link between RLHF and sycophancy. Published at ICLR 2024. Demonstrates sycophancy across five AI assistants and four tasks, with analysis of human preference data showing systematic bias toward agreeable responses.