R0040/2026-03-28/Q001/S02/R01¶
Original DPO paper from Stanford, published at NeurIPS 2023.
Summary¶
| Field | Value |
|---|---|
| Title | Direct Preference Optimization: Your Language Model is Secretly a Reward Model |
| URL | https://arxiv.org/abs/2305.18290 |
| Date accessed | 2026-03-28 |
| Publication date | 2023-05-29 (revised 2024-07-29) |
| Author(s) | Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn |
| Publication | NeurIPS 2023 |
Selection Decision¶
Included in evidence base: Yes
Rationale: Primary source for DPO — the most widely adopted RLHF alternative. Provides both the theoretical derivation showing DPO solves the same optimization as RLHF and empirical results showing equivalent or superior performance.