Skip to content

SRC08 — Open Problems and Fundamental Limitations of RLHF

Source

Title Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Publisher TMLR 2023 / arXiv
Authors Stephen Casper, et al. (32 co-authors)
Date July 2023
URL https://arxiv.org/abs/2307.15217
Type Peer-reviewed survey paper

Summary Ratings

Dimension Rating
Reliability High
Relevance High
Missing data bias Low
Measurement bias Low
Selective reporting bias Low
Randomization bias N/A
Protocol deviation bias N/A
COI / Funding bias Low

Rationale

Dimension Rationale
Reliability Comprehensive survey of 250+ papers, multi-institutional authorship
Relevance Systematically categorizes RLHF problems including those that drive sycophancy

Evidence Extracts

Evidence Summary
SRC08-E01 RLHF has fundamental (not just tractable) limitations in feedback, reward models, and policy