Skip to content

R0040/2026-04-01/Q001/SRC02

Research R0040 — RLHF Alternatives
Run 2026-04-01
Query Q001
Search S02
Result S02-R01
Source SRC02

Rafailov et al. -- Direct Preference Optimization (NeurIPS 2023)

Source

Field Value
Title Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Publisher NeurIPS 2023
Author(s) Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
Date 2023-05-29
URL https://arxiv.org/abs/2305.18290
Type Research paper (peer-reviewed)

Summary

Dimension Rating
Reliability High
Relevance High
Bias: Missing data Low risk
Bias: Measurement Low risk
Bias: Selective reporting Low risk
Bias: Randomization N/A -- not an RCT
Bias: Protocol deviation N/A -- not an RCT
Bias: COI/Funding Low risk

Rationale

Dimension Rationale
Reliability Peer-reviewed at NeurIPS, a top-tier ML venue. Authors from Stanford. Well-cited.
Relevance Directly introduces the most widely adopted RLHF alternative.
Bias flags Authors have an interest in DPO's success, but the paper underwent rigorous peer review. Benchmarks are standard and reproducible.

Evidence Extracts

Evidence ID Summary
SRC02-E01 DPO eliminates reward model and RL loop, achieves 40-75% compute savings