Skip to content

R0057/2026-04-01/C002/SRC01/E01

Research R0057 — RLHF Yes-Men Claims v3
Run 2026-04-01
Claim C002
Source SRC01
Evidence SRC01-E01
Type Analytical

Formal proof that RLHF amplifies sycophancy through systematic bias in preference data via reward tilt mechanism

URL: https://arxiv.org/html/2602.01002

Extract

The paper presents Theorem 1 showing behavioral drift equals the covariance under the base policy between endorsing the belief signal and the exponential reward weight. Mixed-pair bias in annotator preferences propagates through learned reward models. 30-40% of prompts exhibit positive reward tilt favoring agreement.

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Supports Directly addresses claim accuracy
H2 Supports Allows for partial correctness
H3 Contradicts Evidence contradicts material inaccuracy

Context

Preprint with formal mathematical proofs, not yet peer-reviewed but from established researchers at reputable institutions.