R01¶


Research	R0040 — RLHF Alternatives
Run	2026-03-28
Query	Q001
Search	S02
Result	S02-R01

Original DPO paper from Stanford, published at NeurIPS 2023.

Summary¶

Field	Value
Title	Direct Preference Optimization: Your Language Model is Secretly a Reward Model
URL	https://arxiv.org/abs/2305.18290
Date accessed	2026-03-28
Publication date	2023-05-29 (revised 2024-07-29)
Author(s)	Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
Publication	NeurIPS 2023

Selection Decision¶

Included in evidence base: Yes

Rationale: Primary source for DPO — the most widely adopted RLHF alternative. Provides both the theoretical derivation showing DPO solves the same optimization as RLHF and empirical results showing equivalent or superior performance.