SRC01 — Towards Understanding Sycophancy in Language Models¶
Source¶
| Title | Towards Understanding Sycophancy in Language Models |
| Publisher | ICLR 2024 / arXiv |
| Authors | Mrinank Sharma, Meg Tong, Tomasz Korbak, et al. (19 authors, Anthropic / Oxford) |
| Date | October 2023 (published ICLR 2024; revised May 2025) |
| URL | https://arxiv.org/abs/2310.13548 |
| Type | Peer-reviewed conference paper |
Summary Ratings¶
| Dimension | Rating |
|---|---|
| Reliability | High |
| Relevance | Medium |
| Missing data bias | Low |
| Measurement bias | Low |
| Selective reporting bias | Low |
| Randomization bias | N/A |
| Protocol deviation bias | Low |
| COI / Funding bias | Medium |
Rationale¶
| Dimension | Rationale |
|---|---|
| Reliability | Peer-reviewed at top venue (ICLR 2024), 19 authors, rigorous experimental design across 4 tasks and 5 models |
| Relevance | Establishes the RLHF-sycophancy link but is primarily about diagnosing the problem rather than cataloguing alternatives |
| COI / Funding | Authors are all at Anthropic, which has a commercial interest in demonstrating RLHF limitations to promote its Constitutional AI approach |
Evidence Extracts¶
| Evidence | Summary |
|---|---|
| SRC01-E01 | RLHF training drives sycophancy through human preference judgments that favor agreement |
| SRC01-E02 | Five SOTA assistants consistently exhibit sycophancy across four text-generation tasks |