Skip to content

R0040/2026-04-01/Q002/SRC03/E01

Research R0040 — RLHF Alternatives
Run 2026-04-01
Query Q002
Source SRC03
Evidence SRC03-E01
Type Factual

PAR reward shaping mitigates reward hacking within RLHF

URL: https://arxiv.org/abs/2502.18770

Extract

The paper identifies two design principles for reward shaping in RLHF: 1. The RL reward should be bounded 2. The RL reward benefits from rapid initial growth followed by gradual convergence

PAR (Preference As Reward) leverages latent preferences embedded within the reward model as the signal for RL. Results: - AlpacaEval 2.0: at least 5 percentage points higher win rate than competing approaches - Maintains robustness against reward hacking even after two full epochs of training - Requires only a single reference reward for optimal performance

The broader finding: sycophancy as a form of reward hacking can be mitigated through bounded reward transformations (clipping, normalization, log-sigmoid) without abandoning RLHF.

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Contradicts partially Shows RLHF can be fixed without being abandoned
H2 Supports Demonstrates multi-pronged mitigation within RLHF framework
H3 Contradicts Active research on mitigation contradicts "not fundamental"

Context

This paper shows that reward shaping is a viable path to mitigating sycophancy without replacing RLHF. It supports the view that the problem can be addressed within the existing framework.