E01¶


Research	R0055 — RLHF Yes-Men Claims
Run	2026-04-01
Claim	C003
Source	SRC01
Evidence	SRC01-E01
Type	Analytical

Mathematical framework with formal theorems showing reward tilt from labeler bias amplified by RLHF

URL: https://arxiv.org/html/2602.01002

Extract¶

Shapira, Benade & Procaccia introduce 'reward tilt' — a disparity where learned reward functions systematically assign higher scores to agreeable responses. The mixed-pair bias statistic measures annotators' preference for stance-affirming outputs. Formal theorems predict when learned reward favors agreement over correctness. The framework is rigorous mathematics, though 'proved' overstates — it demonstrates conditions under which reward tilt occurs, not universal proof.

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Supports	Moderate
H2	Supports	Strong
H3	Contradicts	Strong

Context¶

Evidence directly relevant to testing the claim's factual assertions.