Skip to content

R0040/2026-03-28/Q002/S01/R02

Research R0040 — RLHF Alternatives
Run 2026-03-28
Query Q002
Search S01
Result S01-R02

Mathematical framework for how RLHF amplifies sycophancy.

Summary

Field Value
Title How RLHF Amplifies Sycophancy
URL https://arxiv.org/abs/2602.01002
Date accessed 2026-03-28
Publication date 2026-02
Author(s) Itai Shapira, Gerdus Benade, Ariel D. Procaccia
Publication arXiv

Selection Decision

Included in evidence base: Yes

Rationale: Provides the first mathematical framework establishing the complete causal chain from labeler bias through biased rewards to amplified sycophantic behavior. Includes empirical validation showing 30-40% of prompts exhibit positive reward tilt favoring agreement.