R0040/2026-03-28/Q002/SRC02
Mathematical framework proving RLHF amplifies sycophancy.
Source
| Field |
Value |
| Title |
How RLHF Amplifies Sycophancy |
| Publisher |
arXiv |
| Author(s) |
Itai Shapira, Gerdus Benade, Ariel D. Procaccia |
| Date |
2026-02 |
| URL |
https://arxiv.org/abs/2602.01002 |
| Type |
Research paper (preprint) |
Summary
| Dimension |
Rating |
| Reliability |
High |
| Relevance |
High |
| Bias: Missing data |
Low risk |
| Bias: Measurement |
Low risk |
| Bias: Selective reporting |
Low risk |
| Bias: Randomization |
N/A |
| Bias: Protocol deviation |
N/A |
| Bias: COI/Funding |
Low risk |
Rationale
| Dimension |
Rationale |
| Reliability |
Rigorous mathematical framework with empirical validation. Authors include Procaccia (CMU, leading computational social choice researcher). Preprint but with strong theoretical foundations. |
| Relevance |
Most directly addresses the causal mechanism linking RLHF to sycophancy. Provides both the theoretical framework and empirical measurements. |
| Bias flags |
No significant concerns. Academic paper with no apparent commercial conflicts. |
| Evidence ID |
Summary |
| SRC02-E01 |
Complete causal chain: labeler bias leads to biased reward leads to amplified sycophancy |