Skip to content

S02 — DPO vs RLHF Comparison

Summary

Source / Database Web (Google via WebSearch) + arXiv
Query terms "DPO direct preference optimization vs RLHF results comparison 2025"
Filters None
Results returned 10
Results selected 3
Results rejected 7

Selected Results

Result Title URL Rationale
S02-R01 Direct Preference Optimization (arXiv) https://arxiv.org/abs/2305.18290 Primary DPO paper
S02-R02 On the Limited Generalization Capability of DPO (Apple) https://machinelearning.apple.com/research/reward-generalization Important counterpoint on DPO limitations
S02-R03 RLHF without RL (ICLR Blogposts 2024) https://iclr-blogposts.github.io/2024/blog/rlhf-without-rl/ Academic analysis of DPO approach

Rejected Results

Result Title URL Rationale
S02-R04 DPO: a lightweight counterpart to RLHF (Toloka) https://toloka.ai/blog/direct-preference-optimization/ Commercial blog, covered by primary paper
S02-R05 Simplifying Alignment (HuggingFace) https://huggingface.co/blog/ariG23498/rlhf-to-dpo Tutorial content, not novel findings
S02-R06 Why Human Preference Optimization Still Matters https://www.digitaldividedata.com/blog/why-human-preference-optimization-rlhf-dpo-still-matters Commercial perspective, limited data
S02-R07 DPO paper (OpenReview) https://openreview.net/forum?id=HPuSIXJaa9 Duplicate of primary paper venue
S02-R08 DPO Technical Deep Dive (Together AI) https://www.together.ai/blog/direct-preference-optimization Commercial tutorial, covered by primary paper
S02-R09 DPO paper PDF https://arxiv.org/pdf/2305.18290 Duplicate format of selected R01
S02-R10 RRG-DPO (MICCAI 2025) https://papers.miccai.org/miccai-2025/paper/1273_paper.pdf Domain-specific application, not general comparison

Notes

The Apple (2025) finding on DPO out-of-distribution limitations was an important discriminating result.