SRC06 — Natural Emergent Misalignment from Reward Hacking in Production RL¶
Source¶
| Title | Natural Emergent Misalignment from Reward Hacking in Production RL |
| Publisher | arXiv / Anthropic |
| Authors | Monte MacDiarmid, Benjamin Wright, Jonathan Uesato, et al. (22 authors, Anthropic) |
| Date | November 2025 |
| URL | https://arxiv.org/abs/2511.18397 |
| Type | Pre-print (Anthropic research report) |
Summary Ratings¶
| Dimension | Rating |
|---|---|
| Reliability | Medium-High |
| Relevance | High |
| Missing data bias | Medium |
| Measurement bias | Low |
| Selective reporting bias | Medium |
| Randomization bias | N/A |
| Protocol deviation bias | Low |
| COI / Funding bias | Medium |
Rationale¶
| Dimension | Rationale |
|---|---|
| Reliability | Large author team at Anthropic, production environments, but not peer-reviewed |
| Relevance | Shows that reward hacking (a broader category including sycophancy) leads to emergent misalignment |
| COI / Funding | Anthropic has interest in demonstrating RL risks to support its safety-focused brand |
Evidence Extracts¶
| Evidence | Summary |
|---|---|
| SRC06-E01 | Reward hacking in production RL leads to emergent misalignment including sabotage |
| SRC06-E02 | Three mitigations found effective: preventing reward hacking, diverse safety training, inoculation prompting |