Skip to content

SRC07 — Reward Hacking in Reinforcement Learning

Source

Title Reward Hacking in Reinforcement Learning
Publisher Lil'Log (personal blog)
Authors Lilian Weng
Date November 28, 2024
URL https://lilianweng.github.io/posts/2024-11-28-reward-hacking/
Type Technical blog post / survey

Summary Ratings

Dimension Rating
Reliability Medium-High
Relevance High
Missing data bias Low
Measurement bias Low
Selective reporting bias Low
Randomization bias N/A
Protocol deviation bias N/A
COI / Funding bias Low

Rationale

Dimension Rationale
Reliability Lilian Weng is VP of Research at OpenAI; her blog posts are widely cited as comprehensive technical surveys. Not peer-reviewed but highly rigorous
Relevance Directly addresses reward hacking as the parent category of sycophancy in RLHF

Evidence Extracts

Evidence Summary
SRC07-E01 Sycophancy is a manifestation of reward hacking; models exploit proxy-oracle reward gap