Skip to content

SRC02 — Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Source

Title Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Publisher NeurIPS 2023 / arXiv
Authors Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
Date May 2023 (revised July 2024)
URL https://arxiv.org/abs/2305.18290
Type Peer-reviewed conference paper

Summary Ratings

Dimension Rating
Reliability High
Relevance High
Missing data bias Low
Measurement bias Low
Selective reporting bias Medium
Randomization bias N/A
Protocol deviation bias Low
COI / Funding bias Low

Rationale

Dimension Rationale
Reliability Peer-reviewed at NeurIPS 2023, Stanford authors, highly cited (6000+ citations)
Relevance Directly proposes a major RLHF alternative that has been widely adopted
Selective reporting Results primarily compared on tasks favorable to DPO; later work (Apple, 2025) found out-of-distribution degradation

Evidence Extracts

Evidence Summary
SRC02-E01 DPO eliminates the reward model, solving RLHF as a classification problem
SRC02-E02 DPO matches or exceeds RLHF performance while being simpler