S02-R01 — Direct Preference Optimization: Your Language Model is Secretly a Reward Model¶

Summary¶


Title	Direct Preference Optimization: Your Language Model is Secretly a Reward Model
URL	https://arxiv.org/abs/2305.18290
Date accessed	2026-03-29
Publication date	May 2023 (revised July 2024)
Authors	Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
Publication	NeurIPS 2023

Selected as the primary paper introducing DPO. Seminal work with 6000+ citations that established the most widely adopted RLHF alternative.