SRC02 — Direct Preference Optimization: Your Language Model is Secretly a Reward Model¶

Source¶


Title	Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Publisher	NeurIPS 2023 / arXiv
Authors	Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
Date	May 2023 (revised July 2024)
URL	https://arxiv.org/abs/2305.18290
Type	Peer-reviewed conference paper

Dimension	Rationale
Reliability	Peer-reviewed at NeurIPS 2023, Stanford authors, highly cited (6000+ citations)
Relevance	Directly proposes a major RLHF alternative that has been widely adopted
Selective reporting	Results primarily compared on tasks favorable to DPO; later work (Apple, 2025) found out-of-distribution degradation

Evidence	Summary
SRC02-E01	DPO eliminates the reward model, solving RLHF as a classification problem
SRC02-E02	DPO matches or exceeds RLHF performance while being simpler