R0040/2026-03-28/Q002/S01/R02¶
Mathematical framework for how RLHF amplifies sycophancy.
Summary¶
| Field | Value |
|---|---|
| Title | How RLHF Amplifies Sycophancy |
| URL | https://arxiv.org/abs/2602.01002 |
| Date accessed | 2026-03-28 |
| Publication date | 2026-02 |
| Author(s) | Itai Shapira, Gerdus Benade, Ariel D. Procaccia |
| Publication | arXiv |
Selection Decision¶
Included in evidence base: Yes
Rationale: Provides the first mathematical framework establishing the complete causal chain from labeler bias through biased rewards to amplified sycophantic behavior. Includes empirical validation showing 30-40% of prompts exhibit positive reward tilt favoring agreement.