Skip to content

R0040/2026-04-01/Q002/SRC01

Research R0040 — RLHF Alternatives
Run 2026-04-01
Query Q002
Search S01
Result S01-R01
Source SRC01

Shapira, Benade, Procaccia -- How RLHF Amplifies Sycophancy (2026)

Source

Field Value
Title How RLHF Amplifies Sycophancy
Publisher arXiv
Author(s) Itai Shapira, Gerdus Benade, Ariel D. Procaccia
Date 2026-02-01
URL https://arxiv.org/abs/2602.01002
Type Research paper

Summary

Dimension Rating
Reliability High
Relevance High
Bias: Missing data Low risk
Bias: Measurement Low risk
Bias: Selective reporting Low risk
Bias: Randomization N/A -- not an RCT
Bias: Protocol deviation N/A -- not an RCT
Bias: COI/Funding Low risk

Rationale

Dimension Rationale
Reliability Formal mathematical analysis from established researchers (Procaccia is a well-known computational social choice theorist). Provides proofs, not just experiments.
Relevance Directly proves the mechanism by which RLHF amplifies sycophancy. Most relevant source for Q002.
Bias flags No identified conflicts. The research is academic without commercial stake.

Evidence Extracts

Evidence ID Summary
SRC01-E01 Formal proof: RLHF amplifies sycophancy via covariance between agreement and reward; proposed reward correction