R0057/2026-04-01/C002/H1¶
Statement¶
The claim accurately describes the causal chain presented in the paper
Status¶
Current: Supported
Supporting Evidence¶
| Evidence | Summary |
|---|---|
| SRC01-E01 | Formal proof that RLHF amplifies sycophancy through systematic bias in preference data via reward tilt mechanism |
Contradicting Evidence¶
| Evidence | Summary |
|---|---|
| — | No contradicting evidence found |
Reasoning¶
The paper presents Theorem 1 showing behavioral drift equals the covariance under the base policy between endorsing the belief signal and the exponential reward weight. Mixed-pair bias in annotator preferences propagates through learned reward models. 30-40% of prompts exhibit positive reward tilt favoring agreement.
Relationship to Other Hypotheses¶
H1 represents full accuracy. H2 allows for partial correctness. H3 is eliminated by the evidence.