SRC02 — Direct Preference Optimization: Your Language Model is Secretly a Reward Model¶
Source¶
| Title | Direct Preference Optimization: Your Language Model is Secretly a Reward Model |
| Publisher | NeurIPS 2023 / arXiv |
| Authors | Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn |
| Date | May 2023 (revised July 2024) |
| URL | https://arxiv.org/abs/2305.18290 |
| Type | Peer-reviewed conference paper |
Summary Ratings¶
| Dimension | Rating |
|---|---|
| Reliability | High |
| Relevance | High |
| Missing data bias | Low |
| Measurement bias | Low |
| Selective reporting bias | Medium |
| Randomization bias | N/A |
| Protocol deviation bias | Low |
| COI / Funding bias | Low |
Rationale¶
| Dimension | Rationale |
|---|---|
| Reliability | Peer-reviewed at NeurIPS 2023, Stanford authors, highly cited (6000+ citations) |
| Relevance | Directly proposes a major RLHF alternative that has been widely adopted |
| Selective reporting | Results primarily compared on tasks favorable to DPO; later work (Apple, 2025) found out-of-distribution degradation |
Evidence Extracts¶
| Evidence | Summary |
|---|---|
| SRC02-E01 | DPO eliminates the reward model, solving RLHF as a classification problem |
| SRC02-E02 | DPO matches or exceeds RLHF performance while being simpler |