S02 — DPO vs RLHF Comparison¶
Summary¶
| Source / Database | Web (Google via WebSearch) + arXiv |
| Query terms | "DPO direct preference optimization vs RLHF results comparison 2025" |
| Filters | None |
| Results returned | 10 |
| Results selected | 3 |
| Results rejected | 7 |
Selected Results¶
| Result | Title | URL | Rationale |
|---|---|---|---|
| S02-R01 | Direct Preference Optimization (arXiv) | https://arxiv.org/abs/2305.18290 | Primary DPO paper |
| S02-R02 | On the Limited Generalization Capability of DPO (Apple) | https://machinelearning.apple.com/research/reward-generalization | Important counterpoint on DPO limitations |
| S02-R03 | RLHF without RL (ICLR Blogposts 2024) | https://iclr-blogposts.github.io/2024/blog/rlhf-without-rl/ | Academic analysis of DPO approach |
Rejected Results¶
| Result | Title | URL | Rationale |
|---|---|---|---|
| S02-R04 | DPO: a lightweight counterpart to RLHF (Toloka) | https://toloka.ai/blog/direct-preference-optimization/ | Commercial blog, covered by primary paper |
| S02-R05 | Simplifying Alignment (HuggingFace) | https://huggingface.co/blog/ariG23498/rlhf-to-dpo | Tutorial content, not novel findings |
| S02-R06 | Why Human Preference Optimization Still Matters | https://www.digitaldividedata.com/blog/why-human-preference-optimization-rlhf-dpo-still-matters | Commercial perspective, limited data |
| S02-R07 | DPO paper (OpenReview) | https://openreview.net/forum?id=HPuSIXJaa9 | Duplicate of primary paper venue |
| S02-R08 | DPO Technical Deep Dive (Together AI) | https://www.together.ai/blog/direct-preference-optimization | Commercial tutorial, covered by primary paper |
| S02-R09 | DPO paper PDF | https://arxiv.org/pdf/2305.18290 | Duplicate format of selected R01 |
| S02-R10 | RRG-DPO (MICCAI 2025) | https://papers.miccai.org/miccai-2025/paper/1273_paper.pdf | Domain-specific application, not general comparison |
Notes¶
The Apple (2025) finding on DPO out-of-distribution limitations was an important discriminating result.