S02 — DPO vs RLHF Comparison¶

Summary¶


Source / Database	Web (Google via WebSearch) + arXiv
Query terms	"DPO direct preference optimization vs RLHF results comparison 2025"
Filters	None
Results returned	10
Results selected	3
Results rejected	7

Result	Title	URL	Rationale
S02-R01	Direct Preference Optimization (arXiv)	https://arxiv.org/abs/2305.18290	Primary DPO paper
S02-R02	On the Limited Generalization Capability of DPO (Apple)	https://machinelearning.apple.com/research/reward-generalization	Important counterpoint on DPO limitations
S02-R03	RLHF without RL (ICLR Blogposts 2024)	https://iclr-blogposts.github.io/2024/blog/rlhf-without-rl/	Academic analysis of DPO approach

Result	Title	URL	Rationale
S02-R04	DPO: a lightweight counterpart to RLHF (Toloka)	https://toloka.ai/blog/direct-preference-optimization/	Commercial blog, covered by primary paper
S02-R05	Simplifying Alignment (HuggingFace)	https://huggingface.co/blog/ariG23498/rlhf-to-dpo	Tutorial content, not novel findings
S02-R06	Why Human Preference Optimization Still Matters	https://www.digitaldividedata.com/blog/why-human-preference-optimization-rlhf-dpo-still-matters	Commercial perspective, limited data
S02-R07	DPO paper (OpenReview)	https://openreview.net/forum?id=HPuSIXJaa9	Duplicate of primary paper venue
S02-R08	DPO Technical Deep Dive (Together AI)	https://www.together.ai/blog/direct-preference-optimization	Commercial tutorial, covered by primary paper
S02-R09	DPO paper PDF	https://arxiv.org/pdf/2305.18290	Duplicate format of selected R01
S02-R10	RRG-DPO (MICCAI 2025)	https://papers.miccai.org/miccai-2025/paper/1273_paper.pdf	Domain-specific application, not general comparison

The Apple (2025) finding on DPO out-of-distribution limitations was an important discriminating result.