E01¶


Research	R0040 — RLHF Alternatives
Run	2026-04-01
Query	Q002
Source	SRC03
Evidence	SRC03-E01
Type	Factual

PAR reward shaping mitigates reward hacking within RLHF

URL: https://arxiv.org/abs/2502.18770

Extract¶

The paper identifies two design principles for reward shaping in RLHF: 1. The RL reward should be bounded 2. The RL reward benefits from rapid initial growth followed by gradual convergence

PAR (Preference As Reward) leverages latent preferences embedded within the reward model as the signal for RL. Results: - AlpacaEval 2.0: at least 5 percentage points higher win rate than competing approaches - Maintains robustness against reward hacking even after two full epochs of training - Requires only a single reference reward for optimal performance

The broader finding: sycophancy as a form of reward hacking can be mitigated through bounded reward transformations (clipping, normalization, log-sigmoid) without abandoning RLHF.

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Contradicts partially	Shows RLHF can be fixed without being abandoned
H2	Supports	Demonstrates multi-pronged mitigation within RLHF framework
H3	Contradicts	Active research on mitigation contradicts "not fundamental"

Context¶

This paper shows that reward shaping is a viable path to mitigating sycophancy without replacing RLHF. It supports the view that the problem can be addressed within the existing framework.