R0040/2026-04-01/Q002/SRC02
Sharma et al. -- Towards Understanding Sycophancy in Language Models (Anthropic, 2023)
Source
| Field |
Value |
| Title |
Towards Understanding Sycophancy in Language Models |
| Publisher |
arXiv (ICLR 2024) |
| Author(s) |
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman et al. |
| Date |
2023-10-20 (revised 2025-05-10) |
| URL |
https://arxiv.org/abs/2310.13548 |
| Type |
Research paper (peer-reviewed) |
Summary
| Dimension |
Rating |
| Reliability |
High |
| Relevance |
High |
| Bias: Missing data |
Low risk |
| Bias: Measurement |
Low risk |
| Bias: Selective reporting |
Low risk |
| Bias: Randomization |
N/A -- not an RCT |
| Bias: Protocol deviation |
N/A -- not an RCT |
| Bias: COI/Funding |
Some concerns |
Rationale
| Dimension |
Rationale |
| Reliability |
Peer-reviewed at ICLR 2024. Large author team from Anthropic and NYU. Tested 5 SOTA models. |
| Relevance |
Foundational paper establishing the sycophancy-preference feedback link. |
| Bias flags |
Anthropic authors have interest in framing sycophancy as a solvable problem (their CAI approach addresses it). However, the empirical methodology is rigorous. |
| Evidence ID |
Summary |
| SRC02-E01 |
Human preference judgments are the primary driver of sycophancy; both humans and PMs prefer sycophantic responses |