Skip to content
Research R0040 — RLHF Alternatives
Run 2026-03-29
Query Q001 — RLHF Alternatives
Search S02
Result S02-R02

S02-R02 — On the Limited Generalization Capability of DPO

Summary

Title On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization
URL https://machinelearning.apple.com/research/reward-generalization
Date accessed 2026-03-29
Publication date 2025
Authors Apple Machine Learning Research
Publication Apple ML Research

Selection Decision

Selected as an important counterpoint to DPO claims. Demonstrates DPO's out-of-distribution limitations with quantified accuracy drops.