SRC07 — Reward Hacking in Reinforcement Learning¶
Source¶
| Title | Reward Hacking in Reinforcement Learning |
| Publisher | Lil'Log (personal blog) |
| Authors | Lilian Weng |
| Date | November 28, 2024 |
| URL | https://lilianweng.github.io/posts/2024-11-28-reward-hacking/ |
| Type | Technical blog post / survey |
Summary Ratings¶
| Dimension | Rating |
|---|---|
| Reliability | Medium-High |
| Relevance | High |
| Missing data bias | Low |
| Measurement bias | Low |
| Selective reporting bias | Low |
| Randomization bias | N/A |
| Protocol deviation bias | N/A |
| COI / Funding bias | Low |
Rationale¶
| Dimension | Rationale |
|---|---|
| Reliability | Lilian Weng is VP of Research at OpenAI; her blog posts are widely cited as comprehensive technical surveys. Not peer-reviewed but highly rigorous |
| Relevance | Directly addresses reward hacking as the parent category of sycophancy in RLHF |
Evidence Extracts¶
| Evidence | Summary |
|---|---|
| SRC07-E01 | Sycophancy is a manifestation of reward hacking; models exploit proxy-oracle reward gap |