Skip to content

R0040/2026-03-28/Q001/SRC02

Research R0040 — RLHF Alternatives
Run 2026-03-28
Query Q001
Search S02
Result S02-R01
Source SRC02

Original DPO paper from Stanford, published at NeurIPS 2023.

Source

Field Value
Title Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Publisher NeurIPS 2023
Author(s) Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
Date 2023-05-29
URL https://arxiv.org/abs/2305.18290
Type Research paper (peer-reviewed)

Summary

Dimension Rating
Reliability High
Relevance High
Bias: Missing data Low risk
Bias: Measurement Low risk
Bias: Selective reporting Some concerns
Bias: Randomization N/A
Bias: Protocol deviation N/A
Bias: COI/Funding Low risk

Rationale

Dimension Rationale
Reliability Peer-reviewed at a top venue (NeurIPS 2023). Authors from Stanford with strong ML credentials (Manning, Finn, Ermon).
Relevance Directly defines the most widely adopted RLHF alternative.
Bias flags Selective reporting: authors compare primarily against PPO-based RLHF and best-of-N sampling. Subsequent work has identified cases where DPO underperforms RLHF on certain tasks.

Evidence Extracts

Evidence ID Summary
SRC02-E01 DPO eliminates reward model and RL, matches or exceeds RLHF performance