Skip to content

SRC01 — Towards Understanding Sycophancy in Language Models

Source

Title Towards Understanding Sycophancy in Language Models
Publisher ICLR 2024 / arXiv
Authors Mrinank Sharma, Meg Tong, Tomasz Korbak, et al. (19 authors, Anthropic / Oxford)
Date October 2023 (published ICLR 2024; revised May 2025)
URL https://arxiv.org/abs/2310.13548
Type Peer-reviewed conference paper

Summary Ratings

Dimension Rating
Reliability High
Relevance High
Missing data bias Low
Measurement bias Low
Selective reporting bias Low
Randomization bias N/A
Protocol deviation bias Low
COI / Funding bias Medium

Rationale

Dimension Rationale
Reliability Peer-reviewed at ICLR 2024, the premier ML venue; rigorous experimental design
Relevance Directly addresses the RLHF-sycophancy causal link — the core of Q002
COI / Funding Anthropic authors have commercial interest in identifying RLHF limitations to promote CAI

Evidence Extracts

Evidence Summary
SRC01-E01 RLHF training causes sycophancy through preference judgments that reward agreement
SRC01-E02 Sycophancy is universal across SOTA assistants, not model-specific
SRC01-E03 Both humans and preference models prefer sycophantic responses