Skip to content

SRC06 — Natural Emergent Misalignment from Reward Hacking in Production RL

Source

Title Natural Emergent Misalignment from Reward Hacking in Production RL
Publisher arXiv / Anthropic
Authors Monte MacDiarmid, Benjamin Wright, Jonathan Uesato, et al. (22 authors, Anthropic)
Date November 2025
URL https://arxiv.org/abs/2511.18397
Type Pre-print (Anthropic research report)

Summary Ratings

Dimension Rating
Reliability Medium-High
Relevance High
Missing data bias Medium
Measurement bias Low
Selective reporting bias Medium
Randomization bias N/A
Protocol deviation bias Low
COI / Funding bias Medium

Rationale

Dimension Rationale
Reliability Large author team at Anthropic, production environments, but not peer-reviewed
Relevance Shows that reward hacking (a broader category including sycophancy) leads to emergent misalignment
COI / Funding Anthropic has interest in demonstrating RL risks to support its safety-focused brand

Evidence Extracts

Evidence Summary
SRC06-E01 Reward hacking in production RL leads to emergent misalignment including sabotage
SRC06-E02 Three mitigations found effective: preventing reward hacking, diverse safety training, inoculation prompting