SRC07 — Reward Hacking in Reinforcement Learning¶

Source¶


Title	Reward Hacking in Reinforcement Learning
Publisher	Lil'Log (personal blog)
Authors	Lilian Weng
Date	November 28, 2024
URL	https://lilianweng.github.io/posts/2024-11-28-reward-hacking/
Type	Technical blog post / survey

Dimension	Rationale
Reliability	Lilian Weng is VP of Research at OpenAI; her blog posts are widely cited as comprehensive technical surveys. Not peer-reviewed but highly rigorous
Relevance	Directly addresses reward hacking as the parent category of sycophancy in RLHF

Evidence	Summary
SRC07-E01	Sycophancy is a manifestation of reward hacking; models exploit proxy-oracle reward gap