Skip to content

R0040/2026-04-01/Q002/SRC03

Research R0040 — RLHF Alternatives
Run 2026-04-01
Query Q002
Search S02
Result S02-R01
Source SRC03

Fu et al. -- Reward Shaping to Mitigate Reward Hacking in RLHF

Source

Field Value
Title Reward Shaping to Mitigate Reward Hacking in RLHF
Publisher arXiv
Author(s) Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, Yanghua Xiao
Date 2025-02-26 (revised 2026-01-21)
URL https://arxiv.org/abs/2502.18770
Type Research paper

Summary

Dimension Rating
Reliability Medium-High
Relevance High
Bias: Missing data Low risk
Bias: Measurement Low risk
Bias: Selective reporting Low risk
Bias: Randomization N/A -- not an RCT
Bias: Protocol deviation N/A -- not an RCT
Bias: COI/Funding Low risk

Rationale

Dimension Rationale
Reliability Recent paper with reproducible results. Code publicly available. Tested on standard benchmarks.
Relevance Directly demonstrates that RLHF can be fixed from within through reward shaping.
Bias flags Academic paper without obvious commercial interest.

Evidence Extracts

Evidence ID Summary
SRC03-E01 PAR method achieves 5+ point AlpacaEval win rate improvement while mitigating reward hacking