SRC01¶

Anthropic's primary research on sycophancy in language models (ICLR 2024).

Source¶

Field	Value
Title	Towards Understanding Sycophancy in Language Models
Publisher	Anthropic / ICLR 2024
Author(s)	Anthropic research team
Date	2023 (first published), 2025 (updated)
URL	https://www.anthropic.com/research/towards-understanding-sycophancy-in-language-models
Type	Research paper

Dimension	Rationale
Reliability	Published at ICLR 2024 (top ML venue). Rigorous experimental methodology.
Relevance	Directly addresses the root cause of the claimed behavior — RLHF-driven sycophancy.
Bias flags	COI concern: Anthropic researching its own models. However, the findings are critical (exposing model weaknesses), which mitigates self-interest bias.

Evidence ID	Summary
SRC01-E01	Sycophancy is systematic RLHF-driven behavior; models prioritize user alignment over accuracy