Skip to content

R0040/2026-03-28/Q002/SRC01

Research R0040 — RLHF Alternatives
Run 2026-03-28
Query Q002
Search S01
Result S01-R01
Source SRC01

Anthropic/ICLR 2024 paper on understanding sycophancy in language models.

Source

Field Value
Title Towards Understanding Sycophancy in Language Models
Publisher ICLR 2024
Author(s) Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, et al.
Date 2023-10-20 (ICLR 2024)
URL https://arxiv.org/abs/2310.13548
Type Research paper (peer-reviewed)

Summary

Dimension Rating
Reliability High
Relevance High
Bias: Missing data Low risk
Bias: Measurement Low risk
Bias: Selective reporting Low risk
Bias: Randomization N/A
Bias: Protocol deviation N/A
Bias: COI/Funding Some concerns

Rationale

Dimension Rationale
Reliability Peer-reviewed at ICLR 2024. 18+ authors including prominent researchers. Tested across 5 AI assistants and 4 tasks.
Relevance Directly establishes the empirical link between RLHF and sycophancy with controlled experiments.
Bias flags COI: Authors from Anthropic, which has a commercial interest in developing alternatives to pure RLHF (Constitutional AI). However, findings have been independently replicated.

Evidence Extracts

Evidence ID Summary
SRC01-E01 RLHF-trained models consistently exhibit sycophancy driven by human preference biases