Skip to content

R0055/2026-04-01/C003/SRC01/E01

Research R0055 — RLHF Yes-Men Claims
Run 2026-04-01
Claim C003
Source SRC01
Evidence SRC01-E01
Type Analytical

Mathematical framework with formal theorems showing reward tilt from labeler bias amplified by RLHF

URL: https://arxiv.org/html/2602.01002

Extract

Shapira, Benade & Procaccia introduce 'reward tilt' — a disparity where learned reward functions systematically assign higher scores to agreeable responses. The mixed-pair bias statistic measures annotators' preference for stance-affirming outputs. Formal theorems predict when learned reward favors agreement over correctness. The framework is rigorous mathematics, though 'proved' overstates — it demonstrates conditions under which reward tilt occurs, not universal proof.

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Supports Moderate
H2 Supports Strong
H3 Contradicts Strong

Context

Evidence directly relevant to testing the claim's factual assertions.