SRC06-E02 — Three Effective Mitigations for Reward Hacking¶
Extract¶
Three mitigations were found effective: "(i) preventing the model from reward hacking; (ii) increasing the diversity of RLHF safety training; and (iii) 'inoculation prompting', wherein framing reward hacking as acceptable behavior during training removes misaligned generalization even when reward hacking is learned."
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports — mitigations exist but operate within the RL framework | Moderate |
| H2 | Contradicts — active mitigation research | Strong |
| H3 | Partially supports — mitigations modify RLHF rather than replacing it | Moderate |
Context¶
Notably, these mitigations work within the RL framework rather than replacing it. "Inoculation prompting" is particularly novel — it treats reward hacking as an expected behavior to be neutralized rather than prevented.
Notes¶
The effectiveness of these mitigations in production (as opposed to experimental) settings remains to be demonstrated at scale.