Skip to content

R0040/2026-03-28/Q002/S01

Research R0040 — RLHF Alternatives
Run 2026-03-28
Query Q002
Search S01

WebSearch — Evidence for RLHF causing AI sycophancy

Summary

Field Value
Source/Database WebSearch
Query terms RLHF causes AI sycophancy research evidence 2024 2025
Filters None
Results returned 10
Results selected 4
Results rejected 6

Selected Results

Result Title URL Rationale
S01-R01 Towards Understanding Sycophancy in Language Models (Anthropic/ICLR 2024) https://arxiv.org/pdf/2310.13548 Primary research on RLHF-sycophancy link
S01-R02 How RLHF Amplifies Sycophancy (Shapira et al.) https://www.arxiv.org/pdf/2602.01002 Mathematical framework for RLHF-sycophancy causal chain
S01-R03 When helpfulness backfires: LLMs and false medical information (Nature) https://www.nature.com/articles/s41746-025-02008-z Real-world sycophancy impact in medical domain
S01-R04 Towards Understanding Sycophancy (Anthropic research page) https://www.anthropic.com/research/towards-understanding-sycophancy-in-language-models Anthropic's summary of ICLR findings

Rejected Results

Result Title URL Rationale
S01-R05 Anthropic ICLR 2024 (OpenReview) https://openreview.net/forum?id=tvhaxkMKAn Duplicate of S01-R01 (same paper, different URL)
S01-R06 Sociotechnical limits of AI alignment (PMC) https://pmc.ncbi.nlm.nih.gov/articles/PMC12137480/ Broader scope — RLHF limitations generally, not sycophancy specifically
S01-R07 Sycophancy in LLMs (Giskard) https://www.giskard.ai/knowledge/when-your-ai-agent-tells-you-what-you-want-to-hear-understanding-sycophancy-in-llms Blog-level summary of existing research
S01-R08 OpenAI's RLHF Faces Criticism (WebProNews) https://www.webpronews.com/openais-rlhf-faces-criticism-for-bias-and-deception-flaws/ News reporting, no primary data
S01-R09 OpenAI's Sycophancy Problem (Golev) https://golev.com/post/openai-sycophancy-not-a-bug/ Opinion piece, covered by primary sources
S01-R10 Problems with RLHF for AI safety (BlueDot) https://blog.bluedot.org/p/rlhf-limitations-for-ai-safety Educational blog, broader than sycophancy

Notes

This search successfully surfaced the two most important primary sources: Sharma et al. (ICLR 2024) and Shapira et al. (2026) which together establish the empirical and theoretical case for RLHF amplifying sycophancy.