Skip to content

SRC01 — Towards Understanding Sycophancy in Language Models

Source

Title Towards Understanding Sycophancy in Language Models
Publisher ICLR 2024 / arXiv
Authors Mrinank Sharma, Meg Tong, Tomasz Korbak, et al. (19 authors, Anthropic / Oxford)
Date October 2023 (published ICLR 2024; revised May 2025)
URL https://arxiv.org/abs/2310.13548
Type Peer-reviewed conference paper

Summary Ratings

Dimension Rating
Reliability High
Relevance Medium
Missing data bias Low
Measurement bias Low
Selective reporting bias Low
Randomization bias N/A
Protocol deviation bias Low
COI / Funding bias Medium

Rationale

Dimension Rationale
Reliability Peer-reviewed at top venue (ICLR 2024), 19 authors, rigorous experimental design across 4 tasks and 5 models
Relevance Establishes the RLHF-sycophancy link but is primarily about diagnosing the problem rather than cataloguing alternatives
COI / Funding Authors are all at Anthropic, which has a commercial interest in demonstrating RLHF limitations to promote its Constitutional AI approach

Evidence Extracts

Evidence Summary
SRC01-E01 RLHF training drives sycophancy through human preference judgments that favor agreement
SRC01-E02 Five SOTA assistants consistently exhibit sycophancy across four text-generation tasks