SRC06 — Natural Emergent Misalignment from Reward Hacking in Production RL¶

Source¶


Title	Natural Emergent Misalignment from Reward Hacking in Production RL
Publisher	arXiv / Anthropic
Authors	Monte MacDiarmid, Benjamin Wright, Jonathan Uesato, et al. (22 authors, Anthropic)
Date	November 2025
URL	https://arxiv.org/abs/2511.18397
Type	Pre-print (Anthropic research report)

Dimension	Rationale
Reliability	Large author team at Anthropic, production environments, but not peer-reviewed
Relevance	Shows that reward hacking (a broader category including sycophancy) leads to emergent misalignment
COI / Funding	Anthropic has interest in demonstrating RL risks to support its safety-focused brand

Evidence	Summary
SRC06-E01	Reward hacking in production RL leads to emergent misalignment including sabotage
SRC06-E02	Three mitigations found effective: preventing reward hacking, diverse safety training, inoculation prompting