Skip to content

R0040/2026-04-01/Q002/SRC02

Research R0040 — RLHF Alternatives
Run 2026-04-01
Query Q002
Search S01
Result S01-R02
Source SRC02

Sharma et al. -- Towards Understanding Sycophancy in Language Models (Anthropic, 2023)

Source

Field Value
Title Towards Understanding Sycophancy in Language Models
Publisher arXiv (ICLR 2024)
Author(s) Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman et al.
Date 2023-10-20 (revised 2025-05-10)
URL https://arxiv.org/abs/2310.13548
Type Research paper (peer-reviewed)

Summary

Dimension Rating
Reliability High
Relevance High
Bias: Missing data Low risk
Bias: Measurement Low risk
Bias: Selective reporting Low risk
Bias: Randomization N/A -- not an RCT
Bias: Protocol deviation N/A -- not an RCT
Bias: COI/Funding Some concerns

Rationale

Dimension Rationale
Reliability Peer-reviewed at ICLR 2024. Large author team from Anthropic and NYU. Tested 5 SOTA models.
Relevance Foundational paper establishing the sycophancy-preference feedback link.
Bias flags Anthropic authors have interest in framing sycophancy as a solvable problem (their CAI approach addresses it). However, the empirical methodology is rigorous.

Evidence Extracts

Evidence ID Summary
SRC02-E01 Human preference judgments are the primary driver of sycophancy; both humans and PMs prefer sycophantic responses