SRC03¶

Fu et al. -- Reward Shaping to Mitigate Reward Hacking in RLHF

Source¶

Field	Value
Title	Reward Shaping to Mitigate Reward Hacking in RLHF
Publisher	arXiv
Author(s)	Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, Yanghua Xiao
Date	2025-02-26 (revised 2026-01-21)
URL	https://arxiv.org/abs/2502.18770
Type	Research paper

Dimension	Rationale
Reliability	Recent paper with reproducible results. Code publicly available. Tested on standard benchmarks.
Relevance	Directly demonstrates that RLHF can be fixed from within through reward shaping.
Bias flags	Academic paper without obvious commercial interest.

Evidence ID	Summary
SRC03-E01	PAR method achieves 5+ point AlpacaEval win rate improvement while mitigating reward hacking