Skip to content

R0040/2026-03-28/Q001/S02/R01

Research R0040 — RLHF Alternatives
Run 2026-03-28
Query Q001
Search S02
Result S02-R01

Original DPO paper from Stanford, published at NeurIPS 2023.

Summary

Field Value
Title Direct Preference Optimization: Your Language Model is Secretly a Reward Model
URL https://arxiv.org/abs/2305.18290
Date accessed 2026-03-28
Publication date 2023-05-29 (revised 2024-07-29)
Author(s) Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
Publication NeurIPS 2023

Selection Decision

Included in evidence base: Yes

Rationale: Primary source for DPO — the most widely adopted RLHF alternative. Provides both the theoretical derivation showing DPO solves the same optimization as RLHF and empirical results showing equivalent or superior performance.