Research	R0040 — RLHF Alternatives
Run	2026-03-29
Query	Q001 — RLHF Alternatives
Source	SRC05
Evidence	SRC05-E01

SRC05-E01 — Three Categories of RLHF Problems¶

Extract¶

The survey identifies problems in three categories: "(1) challenges with feedback, (2) challenges with the reward model, and (3) challenges with the policy." It "highlights the importance of a multi-faceted approach to the development of safer AI systems" and emphasizes that some limitations are fundamental rather than tractable. Specific issues include mode collapse, reward hacking, and the difficulty of developing "a single reward function for diverse users."

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Strongly supports — systematic catalogue of problems motivating alternatives	Strong
H2	Contradicts — fundamental limitations imply alternatives are necessary, not optional	Strong
H3	Strongly supports — the distinction between tractable and fundamental problems explains why multiple alternatives coexist	Strong

Context¶

This is the most comprehensive academic survey of RLHF limitations, widely cited in alignment research. Its distinction between tractable and fundamental limitations is key to understanding the alternative landscape.

Notes¶

The paper recommends "auditing and disclosure standards" as a complementary approach, suggesting technical alternatives alone are insufficient.