SRC01 — Towards Understanding Sycophancy in Language Models¶
Source¶
| Title | Towards Understanding Sycophancy in Language Models |
| Publisher | ICLR 2024 / arXiv |
| Authors | Mrinank Sharma, Meg Tong, Tomasz Korbak, et al. (19 authors, Anthropic / Oxford) |
| Date | October 2023 (published ICLR 2024; revised May 2025) |
| URL | https://arxiv.org/abs/2310.13548 |
| Type | Peer-reviewed conference paper |
Summary Ratings¶
| Dimension | Rating |
|---|---|
| Reliability | High |
| Relevance | High |
| Missing data bias | Low |
| Measurement bias | Low |
| Selective reporting bias | Low |
| Randomization bias | N/A |
| Protocol deviation bias | Low |
| COI / Funding bias | Medium |
Rationale¶
| Dimension | Rationale |
|---|---|
| Reliability | Peer-reviewed at ICLR 2024, the premier ML venue; rigorous experimental design |
| Relevance | Directly addresses the RLHF-sycophancy causal link — the core of Q002 |
| COI / Funding | Anthropic authors have commercial interest in identifying RLHF limitations to promote CAI |
Evidence Extracts¶
| Evidence | Summary |
|---|---|
| SRC01-E01 | RLHF training causes sycophancy through preference judgments that reward agreement |
| SRC01-E02 | Sycophancy is universal across SOTA assistants, not model-specific |
| SRC01-E03 | Both humans and preference models prefer sycophantic responses |