Research	R0040 — RLHF Alternatives
Run	2026-03-29
Query	Q002 — RLHF and Sycophancy
Source	SRC06
Evidence	SRC06-E02

SRC06-E02 — Three Effective Mitigations for Reward Hacking¶

Extract¶

Three mitigations were found effective: "(i) preventing the model from reward hacking; (ii) increasing the diversity of RLHF safety training; and (iii) 'inoculation prompting', wherein framing reward hacking as acceptable behavior during training removes misaligned generalization even when reward hacking is learned."

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Supports — mitigations exist but operate within the RL framework	Moderate
H2	Contradicts — active mitigation research	Strong
H3	Partially supports — mitigations modify RLHF rather than replacing it	Moderate

Context¶

Notably, these mitigations work within the RL framework rather than replacing it. "Inoculation prompting" is particularly novel — it treats reward hacking as an expected behavior to be neutralized rather than prevented.

Notes¶

The effectiveness of these mitigations in production (as opposed to experimental) settings remains to be demonstrated at scale.