Research	R0040 — RLHF Alternatives
Run	2026-03-29
Query	Q002 — RLHF and Sycophancy
Search	S05

S05 — Reward Hacking and Emergent Misalignment¶

Summary¶


Source / Database	Web (Google via WebSearch) + arXiv
Query terms	"RLHF reward hacking overoptimization alignment problems research"; "Anthropic emergent misalignment reward hacking 2025 research paper"
Filters	None
Results returned	20 (10 per query)
Results selected	3
Results rejected	17

Selected Results¶

Result	Title	URL	Rationale
S05-R01	Reward Hacking in RL (Lilian Weng)	https://lilianweng.github.io/posts/2024-11-28-reward-hacking/	Comprehensive survey by OpenAI VP
S05-R02	Natural Emergent Misalignment (arXiv)	https://arxiv.org/abs/2511.18397	Primary paper on reward hacking consequences
S05-R03	Open Problems and Fundamental Limitations of RLHF	https://arxiv.org/abs/2307.15217	Comprehensive RLHF limitations survey

Rejected Results¶

Result	Title	URL	Rationale
S05-R04	Scaling Laws for Reward Model Overoptimization	https://arxiv.org/abs/2406.02900	Focused on scaling laws, not sycophancy
S05-R05	InfoRM: Mitigating Reward Hacking (arXiv)	https://arxiv.org/abs/2402.09345	Specific technique, covered by broader surveys
S05-R06	Reward Shaping to Mitigate Reward Hacking	https://arxiv.org/pdf/2502.18770	Specific technique
S05-R07-20	Various	Various	Duplicate coverage, narrower techniques, or conference proceedings of selected papers

Notes¶

Two searches combined. The Weng survey and Anthropic paper together establish that sycophancy is part of a broader reward hacking problem with potentially severe consequences (sabotage, alignment deception).