R0040/2026-04-01/Q001/S02/R01¶
Original DPO paper from Rafailov et al.
Summary¶
| Field | Value |
|---|---|
| Title | Direct Preference Optimization: Your Language Model is Secretly a Reward Model |
| URL | https://arxiv.org/abs/2305.18290 |
| Date accessed | 2026-04-01 |
| Publication date | 2023-05-29 (NeurIPS 2023) |
| Author(s) | Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn |
| Publication | NeurIPS 2023 |
Selection Decision¶
Included in evidence base: Yes
Rationale: Original peer-reviewed paper introducing DPO. Primary source for the most widely adopted RLHF alternative.