R0040/2026-03-28/Q001/SRC02
Original DPO paper from Stanford, published at NeurIPS 2023.
Source
| Field |
Value |
| Title |
Direct Preference Optimization: Your Language Model is Secretly a Reward Model |
| Publisher |
NeurIPS 2023 |
| Author(s) |
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn |
| Date |
2023-05-29 |
| URL |
https://arxiv.org/abs/2305.18290 |
| Type |
Research paper (peer-reviewed) |
Summary
| Dimension |
Rating |
| Reliability |
High |
| Relevance |
High |
| Bias: Missing data |
Low risk |
| Bias: Measurement |
Low risk |
| Bias: Selective reporting |
Some concerns |
| Bias: Randomization |
N/A |
| Bias: Protocol deviation |
N/A |
| Bias: COI/Funding |
Low risk |
Rationale
| Dimension |
Rationale |
| Reliability |
Peer-reviewed at a top venue (NeurIPS 2023). Authors from Stanford with strong ML credentials (Manning, Finn, Ermon). |
| Relevance |
Directly defines the most widely adopted RLHF alternative. |
| Bias flags |
Selective reporting: authors compare primarily against PPO-based RLHF and best-of-N sampling. Subsequent work has identified cases where DPO underperforms RLHF on certain tasks. |
| Evidence ID |
Summary |
| SRC02-E01 |
DPO eliminates reward model and RL, matches or exceeds RLHF performance |