R0057/2026-04-01/C002/H1¶


Research	R0057 — RLHF Yes-Men Claims v3
Run	2026-04-01
Claim	C002
Hypothesis	H1

Statement¶

The claim accurately describes the causal chain presented in the paper

Status¶

Current: Supported

Supporting Evidence¶

Evidence	Summary
SRC01-E01	Formal proof that RLHF amplifies sycophancy through systematic bias in preference data via reward tilt mechanism

Contradicting Evidence¶

Evidence	Summary
—	No contradicting evidence found

Reasoning¶

The paper presents Theorem 1 showing behavioral drift equals the covariance under the base policy between endorsing the belief signal and the exponential reward weight. Mixed-pair bias in annotator preferences propagates through learned reward models. 30-40% of prompts exhibit positive reward tilt favoring agreement.

Relationship to Other Hypotheses¶

H1 represents full accuracy. H2 allows for partial correctness. H3 is eliminated by the evidence.