R0040/2026-04-01/Q001/S02/R02¶
Apple ML Research on DPO's limited out-of-distribution generalization.
Summary¶
| Field | Value |
|---|---|
| Title | On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization |
| URL | https://machinelearning.apple.com/research/reward-generalization |
| Date accessed | 2026-04-01 |
| Publication date | 2025 (estimated) |
| Author(s) | Apple Machine Learning Research |
| Publication | Apple ML Research |
Selection Decision¶
Included in evidence base: Yes
Rationale: Important counterpoint to DPO's claimed equivalence to RLHF. Shows 3-7% accuracy drop in out-of-distribution settings.