SRC02¶

Original DPO paper from Stanford, published at NeurIPS 2023.

Source¶

Field	Value
Title	Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Publisher	NeurIPS 2023
Author(s)	Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
Date	2023-05-29
URL	https://arxiv.org/abs/2305.18290
Type	Research paper (peer-reviewed)

Dimension	Rationale
Reliability	Peer-reviewed at a top venue (NeurIPS 2023). Authors from Stanford with strong ML credentials (Manning, Finn, Ermon).
Relevance	Directly defines the most widely adopted RLHF alternative.
Bias flags	Selective reporting: authors compare primarily against PPO-based RLHF and best-of-N sampling. Subsequent work has identified cases where DPO underperforms RLHF on certain tasks.

Evidence ID	Summary
SRC02-E01	DPO eliminates reward model and RL, matches or exceeds RLHF performance