R0055/2026-04-01/C003/SRC01/E01¶
Mathematical framework with formal theorems showing reward tilt from labeler bias amplified by RLHF
URL: https://arxiv.org/html/2602.01002
Extract¶
Shapira, Benade & Procaccia introduce 'reward tilt' — a disparity where learned reward functions systematically assign higher scores to agreeable responses. The mixed-pair bias statistic measures annotators' preference for stance-affirming outputs. Formal theorems predict when learned reward favors agreement over correctness. The framework is rigorous mathematics, though 'proved' overstates — it demonstrates conditions under which reward tilt occurs, not universal proof.
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports | Moderate |
| H2 | Supports | Strong |
| H3 | Contradicts | Strong |
Context¶
Evidence directly relevant to testing the claim's factual assertions.