Skip to content

S05 — Reward Hacking and Emergent Misalignment

Summary

Source / Database Web (Google via WebSearch) + arXiv
Query terms "RLHF reward hacking overoptimization alignment problems research"; "Anthropic emergent misalignment reward hacking 2025 research paper"
Filters None
Results returned 20 (10 per query)
Results selected 3
Results rejected 17

Selected Results

Result Title URL Rationale
S05-R01 Reward Hacking in RL (Lilian Weng) https://lilianweng.github.io/posts/2024-11-28-reward-hacking/ Comprehensive survey by OpenAI VP
S05-R02 Natural Emergent Misalignment (arXiv) https://arxiv.org/abs/2511.18397 Primary paper on reward hacking consequences
S05-R03 Open Problems and Fundamental Limitations of RLHF https://arxiv.org/abs/2307.15217 Comprehensive RLHF limitations survey

Rejected Results

Result Title URL Rationale
S05-R04 Scaling Laws for Reward Model Overoptimization https://arxiv.org/abs/2406.02900 Focused on scaling laws, not sycophancy
S05-R05 InfoRM: Mitigating Reward Hacking (arXiv) https://arxiv.org/abs/2402.09345 Specific technique, covered by broader surveys
S05-R06 Reward Shaping to Mitigate Reward Hacking https://arxiv.org/pdf/2502.18770 Specific technique
S05-R07-20 Various Various Duplicate coverage, narrower techniques, or conference proceedings of selected papers

Notes

Two searches combined. The Weng survey and Anthropic paper together establish that sycophancy is part of a broader reward hacking problem with potentially severe consequences (sabotage, alignment deception).