SRC05-E01 — Three Categories of RLHF Problems¶
Extract¶
The survey identifies problems in three categories: "(1) challenges with feedback, (2) challenges with the reward model, and (3) challenges with the policy." It "highlights the importance of a multi-faceted approach to the development of safer AI systems" and emphasizes that some limitations are fundamental rather than tractable. Specific issues include mode collapse, reward hacking, and the difficulty of developing "a single reward function for diverse users."
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Strongly supports — systematic catalogue of problems motivating alternatives | Strong |
| H2 | Contradicts — fundamental limitations imply alternatives are necessary, not optional | Strong |
| H3 | Strongly supports — the distinction between tractable and fundamental problems explains why multiple alternatives coexist | Strong |
Context¶
This is the most comprehensive academic survey of RLHF limitations, widely cited in alignment research. Its distinction between tractable and fundamental limitations is key to understanding the alternative landscape.
Notes¶
The paper recommends "auditing and disclosure standards" as a complementary approach, suggesting technical alternatives alone are insufficient.