Research	R0040 — RLHF Alternatives
Run	2026-03-29
Query	Q001 — RLHF Alternatives
Source	SRC02
Evidence	SRC02-E01

SRC02-E01 — DPO Eliminates the Reward Model¶

Extract¶

DPO "enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss." It is "stable, performant, and computationally lightweight" with "no sampling from the language model during fine-tuning required" and "minimal hyperparameter tuning needed."

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Strongly supports — DPO is a concrete, widely-adopted alternative	Strong
H2	Contradicts — DPO is not merely theoretical but in production use	Strong
H3	Supports — DPO targets computational complexity, not all RLHF failure modes	Moderate

Context¶

DPO is arguably the single most impactful RLHF alternative published to date, spawning an entire family of preference optimization methods (IPO, KTO, ORPO, SimPO).

Notes¶

DPO still uses human preference data; it changes the optimization algorithm, not the feedback source. This means it may still inherit sycophancy from preference data.