SRC02-E01 — DPO Eliminates the Reward Model¶
Extract¶
DPO "enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss." It is "stable, performant, and computationally lightweight" with "no sampling from the language model during fine-tuning required" and "minimal hyperparameter tuning needed."
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Strongly supports — DPO is a concrete, widely-adopted alternative | Strong |
| H2 | Contradicts — DPO is not merely theoretical but in production use | Strong |
| H3 | Supports — DPO targets computational complexity, not all RLHF failure modes | Moderate |
Context¶
DPO is arguably the single most impactful RLHF alternative published to date, spawning an entire family of preference optimization methods (IPO, KTO, ORPO, SimPO).
Notes¶
DPO still uses human preference data; it changes the optimization algorithm, not the feedback source. This means it may still inherit sycophancy from preference data.