R0040/2026-03-28/Q002/S01
WebSearch — Evidence for RLHF causing AI sycophancy
Summary
| Field |
Value |
| Source/Database |
WebSearch |
| Query terms |
RLHF causes AI sycophancy research evidence 2024 2025 |
| Filters |
None |
| Results returned |
10 |
| Results selected |
4 |
| Results rejected |
6 |
Selected Results
| Result |
Title |
URL |
Rationale |
| S01-R01 |
Towards Understanding Sycophancy in Language Models (Anthropic/ICLR 2024) |
https://arxiv.org/pdf/2310.13548 |
Primary research on RLHF-sycophancy link |
| S01-R02 |
How RLHF Amplifies Sycophancy (Shapira et al.) |
https://www.arxiv.org/pdf/2602.01002 |
Mathematical framework for RLHF-sycophancy causal chain |
| S01-R03 |
When helpfulness backfires: LLMs and false medical information (Nature) |
https://www.nature.com/articles/s41746-025-02008-z |
Real-world sycophancy impact in medical domain |
| S01-R04 |
Towards Understanding Sycophancy (Anthropic research page) |
https://www.anthropic.com/research/towards-understanding-sycophancy-in-language-models |
Anthropic's summary of ICLR findings |
Rejected Results
| Result |
Title |
URL |
Rationale |
| S01-R05 |
Anthropic ICLR 2024 (OpenReview) |
https://openreview.net/forum?id=tvhaxkMKAn |
Duplicate of S01-R01 (same paper, different URL) |
| S01-R06 |
Sociotechnical limits of AI alignment (PMC) |
https://pmc.ncbi.nlm.nih.gov/articles/PMC12137480/ |
Broader scope — RLHF limitations generally, not sycophancy specifically |
| S01-R07 |
Sycophancy in LLMs (Giskard) |
https://www.giskard.ai/knowledge/when-your-ai-agent-tells-you-what-you-want-to-hear-understanding-sycophancy-in-llms |
Blog-level summary of existing research |
| S01-R08 |
OpenAI's RLHF Faces Criticism (WebProNews) |
https://www.webpronews.com/openais-rlhf-faces-criticism-for-bias-and-deception-flaws/ |
News reporting, no primary data |
| S01-R09 |
OpenAI's Sycophancy Problem (Golev) |
https://golev.com/post/openai-sycophancy-not-a-bug/ |
Opinion piece, covered by primary sources |
| S01-R10 |
Problems with RLHF for AI safety (BlueDot) |
https://blog.bluedot.org/p/rlhf-limitations-for-ai-safety |
Educational blog, broader than sycophancy |
Notes
This search successfully surfaced the two most important primary sources: Sharma et al. (ICLR 2024) and Shapira et al. (2026) which together establish the empirical and theoretical case for RLHF amplifying sycophancy.