Skip to content
Research R0040 — RLHF Alternatives
Run 2026-03-29
Query Q001 — RLHF Alternatives
Search S02
Result S02-R01

S02-R01 — Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Summary

Title Direct Preference Optimization: Your Language Model is Secretly a Reward Model
URL https://arxiv.org/abs/2305.18290
Date accessed 2026-03-29
Publication date May 2023 (revised July 2024)
Authors Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
Publication NeurIPS 2023

Selection Decision

Selected as the primary paper introducing DPO. Seminal work with 6000+ citations that established the most widely adopted RLHF alternative.