Skip to content

R0040/2026-04-01/Q001/S02/R01

Research R0040 — RLHF Alternatives
Run 2026-04-01
Query Q001
Search S02
Result S02-R01

Original DPO paper from Rafailov et al.

Summary

Field Value
Title Direct Preference Optimization: Your Language Model is Secretly a Reward Model
URL https://arxiv.org/abs/2305.18290
Date accessed 2026-04-01
Publication date 2023-05-29 (NeurIPS 2023)
Author(s) Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
Publication NeurIPS 2023

Selection Decision

Included in evidence base: Yes

Rationale: Original peer-reviewed paper introducing DPO. Primary source for the most widely adopted RLHF alternative.