Skip to content
Research R0040 — RLHF Alternatives
Run 2026-03-29
Query Q001 — RLHF Alternatives
Source SRC05
Evidence SRC05-E01

SRC05-E01 — Three Categories of RLHF Problems

Extract

The survey identifies problems in three categories: "(1) challenges with feedback, (2) challenges with the reward model, and (3) challenges with the policy." It "highlights the importance of a multi-faceted approach to the development of safer AI systems" and emphasizes that some limitations are fundamental rather than tractable. Specific issues include mode collapse, reward hacking, and the difficulty of developing "a single reward function for diverse users."

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Strongly supports — systematic catalogue of problems motivating alternatives Strong
H2 Contradicts — fundamental limitations imply alternatives are necessary, not optional Strong
H3 Strongly supports — the distinction between tractable and fundamental problems explains why multiple alternatives coexist Strong

Context

This is the most comprehensive academic survey of RLHF limitations, widely cited in alignment research. Its distinction between tractable and fundamental limitations is key to understanding the alternative landscape.

Notes

The paper recommends "auditing and disclosure standards" as a complementary approach, suggesting technical alternatives alone are insufficient.