Skip to content

R0040/2026-04-01/Q001/S02/R02

Research R0040 — RLHF Alternatives
Run 2026-04-01
Query Q001
Search S02
Result S02-R02

Apple ML Research on DPO's limited out-of-distribution generalization.

Summary

Field Value
Title On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization
URL https://machinelearning.apple.com/research/reward-generalization
Date accessed 2026-04-01
Publication date 2025 (estimated)
Author(s) Apple Machine Learning Research
Publication Apple ML Research

Selection Decision

Included in evidence base: Yes

Rationale: Important counterpoint to DPO's claimed equivalence to RLHF. Shows 3-7% accuracy drop in out-of-distribution settings.