Skip to content

R0040/2026-03-28/Q002/SRC02

Research R0040 — RLHF Alternatives
Run 2026-03-28
Query Q002
Search S01
Result S01-R02
Source SRC02

Mathematical framework proving RLHF amplifies sycophancy.

Source

Field Value
Title How RLHF Amplifies Sycophancy
Publisher arXiv
Author(s) Itai Shapira, Gerdus Benade, Ariel D. Procaccia
Date 2026-02
URL https://arxiv.org/abs/2602.01002
Type Research paper (preprint)

Summary

Dimension Rating
Reliability High
Relevance High
Bias: Missing data Low risk
Bias: Measurement Low risk
Bias: Selective reporting Low risk
Bias: Randomization N/A
Bias: Protocol deviation N/A
Bias: COI/Funding Low risk

Rationale

Dimension Rationale
Reliability Rigorous mathematical framework with empirical validation. Authors include Procaccia (CMU, leading computational social choice researcher). Preprint but with strong theoretical foundations.
Relevance Most directly addresses the causal mechanism linking RLHF to sycophancy. Provides both the theoretical framework and empirical measurements.
Bias flags No significant concerns. Academic paper with no apparent commercial conflicts.

Evidence Extracts

Evidence ID Summary
SRC02-E01 Complete causal chain: labeler bias leads to biased reward leads to amplified sycophancy