R0040/2026-04-01/Q002/SRC03
Fu et al. -- Reward Shaping to Mitigate Reward Hacking in RLHF
Source
| Field |
Value |
| Title |
Reward Shaping to Mitigate Reward Hacking in RLHF |
| Publisher |
arXiv |
| Author(s) |
Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, Yanghua Xiao |
| Date |
2025-02-26 (revised 2026-01-21) |
| URL |
https://arxiv.org/abs/2502.18770 |
| Type |
Research paper |
Summary
| Dimension |
Rating |
| Reliability |
Medium-High |
| Relevance |
High |
| Bias: Missing data |
Low risk |
| Bias: Measurement |
Low risk |
| Bias: Selective reporting |
Low risk |
| Bias: Randomization |
N/A -- not an RCT |
| Bias: Protocol deviation |
N/A -- not an RCT |
| Bias: COI/Funding |
Low risk |
Rationale
| Dimension |
Rationale |
| Reliability |
Recent paper with reproducible results. Code publicly available. Tested on standard benchmarks. |
| Relevance |
Directly demonstrates that RLHF can be fixed from within through reward shaping. |
| Bias flags |
Academic paper without obvious commercial interest. |
| Evidence ID |
Summary |
| SRC03-E01 |
PAR method achieves 5+ point AlpacaEval win rate improvement while mitigating reward hacking |