R0040/2026-04-01/Q002/SRC03/E01¶
PAR reward shaping mitigates reward hacking within RLHF
URL: https://arxiv.org/abs/2502.18770
Extract¶
The paper identifies two design principles for reward shaping in RLHF: 1. The RL reward should be bounded 2. The RL reward benefits from rapid initial growth followed by gradual convergence
PAR (Preference As Reward) leverages latent preferences embedded within the reward model as the signal for RL. Results: - AlpacaEval 2.0: at least 5 percentage points higher win rate than competing approaches - Maintains robustness against reward hacking even after two full epochs of training - Requires only a single reference reward for optimal performance
The broader finding: sycophancy as a form of reward hacking can be mitigated through bounded reward transformations (clipping, normalization, log-sigmoid) without abandoning RLHF.
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Contradicts partially | Shows RLHF can be fixed without being abandoned |
| H2 | Supports | Demonstrates multi-pronged mitigation within RLHF framework |
| H3 | Contradicts | Active research on mitigation contradicts "not fundamental" |
Context¶
This paper shows that reward shaping is a viable path to mitigating sycophancy without replacing RLHF. It supports the view that the problem can be addressed within the existing framework.