Skip to content
Research R0040 — RLHF Alternatives
Run 2026-03-29
Query Q002 — RLHF and Sycophancy
Source SRC06
Evidence SRC06-E02

SRC06-E02 — Three Effective Mitigations for Reward Hacking

Extract

Three mitigations were found effective: "(i) preventing the model from reward hacking; (ii) increasing the diversity of RLHF safety training; and (iii) 'inoculation prompting', wherein framing reward hacking as acceptable behavior during training removes misaligned generalization even when reward hacking is learned."

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Supports — mitigations exist but operate within the RL framework Moderate
H2 Contradicts — active mitigation research Strong
H3 Partially supports — mitigations modify RLHF rather than replacing it Moderate

Context

Notably, these mitigations work within the RL framework rather than replacing it. "Inoculation prompting" is particularly novel — it treats reward hacking as an expected behavior to be neutralized rather than prevented.

Notes

The effectiveness of these mitigations in production (as opposed to experimental) settings remains to be demonstrated at scale.