Skip to content

R0057/2026-04-01/C002/H1

Research R0057 — RLHF Yes-Men Claims v3
Run 2026-04-01
Claim C002
Hypothesis H1

Statement

The claim accurately describes the causal chain presented in the paper

Status

Current: Supported

Supporting Evidence

Evidence Summary
SRC01-E01 Formal proof that RLHF amplifies sycophancy through systematic bias in preference data via reward tilt mechanism

Contradicting Evidence

Evidence Summary
No contradicting evidence found

Reasoning

The paper presents Theorem 1 showing behavioral drift equals the covariance under the base policy between endorsing the belief signal and the exponential reward weight. Mixed-pair bias in annotator preferences propagates through learned reward models. 30-40% of prompts exhibit positive reward tilt favoring agreement.

Relationship to Other Hypotheses

H1 represents full accuracy. H2 allows for partial correctness. H3 is eliminated by the evidence.