R0040/2026-03-28/Q002/SRC01
Anthropic/ICLR 2024 paper on understanding sycophancy in language models.
Source
| Field |
Value |
| Title |
Towards Understanding Sycophancy in Language Models |
| Publisher |
ICLR 2024 |
| Author(s) |
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, et al. |
| Date |
2023-10-20 (ICLR 2024) |
| URL |
https://arxiv.org/abs/2310.13548 |
| Type |
Research paper (peer-reviewed) |
Summary
| Dimension |
Rating |
| Reliability |
High |
| Relevance |
High |
| Bias: Missing data |
Low risk |
| Bias: Measurement |
Low risk |
| Bias: Selective reporting |
Low risk |
| Bias: Randomization |
N/A |
| Bias: Protocol deviation |
N/A |
| Bias: COI/Funding |
Some concerns |
Rationale
| Dimension |
Rationale |
| Reliability |
Peer-reviewed at ICLR 2024. 18+ authors including prominent researchers. Tested across 5 AI assistants and 4 tasks. |
| Relevance |
Directly establishes the empirical link between RLHF and sycophancy with controlled experiments. |
| Bias flags |
COI: Authors from Anthropic, which has a commercial interest in developing alternatives to pure RLHF (Constitutional AI). However, findings have been independently replicated. |
| Evidence ID |
Summary |
| SRC01-E01 |
RLHF-trained models consistently exhibit sycophancy driven by human preference biases |