S05 — Reward Hacking and Emergent Misalignment
Summary
|
|
| Source / Database |
Web (Google via WebSearch) + arXiv |
| Query terms |
"RLHF reward hacking overoptimization alignment problems research"; "Anthropic emergent misalignment reward hacking 2025 research paper" |
| Filters |
None |
| Results returned |
20 (10 per query) |
| Results selected |
3 |
| Results rejected |
17 |
Selected Results
Rejected Results
| Result |
Title |
URL |
Rationale |
| S05-R04 |
Scaling Laws for Reward Model Overoptimization |
https://arxiv.org/abs/2406.02900 |
Focused on scaling laws, not sycophancy |
| S05-R05 |
InfoRM: Mitigating Reward Hacking (arXiv) |
https://arxiv.org/abs/2402.09345 |
Specific technique, covered by broader surveys |
| S05-R06 |
Reward Shaping to Mitigate Reward Hacking |
https://arxiv.org/pdf/2502.18770 |
Specific technique |
| S05-R07-20 |
Various |
Various |
Duplicate coverage, narrower techniques, or conference proceedings of selected papers |
Notes
Two searches combined. The Weng survey and Anthropic paper together establish that sycophancy is part of a broader reward hacking problem with potentially severe consequences (sabotage, alignment deception).