Research	R0040 — RLHF Alternatives
Run	2026-03-29
Query	Q001 — RLHF Alternatives
Search	S05

S05 — KTO, ORPO, SimPO, and Other DPO Variants¶

Summary¶


Source / Database	Web (Google via WebSearch) + arXiv
Query terms	"KTO kahneman tversky optimization RLHF alternative 2024 2025"; "ORPO odds ratio preference optimization SimPO simple preference optimization 2024 2025"; "SPIN self-play fine-tuning IPO identity preference optimization RLHF alternatives"
Filters	None
Results returned	30 (10 per query)
Results selected	5
Results rejected	25

Selected Results¶

Result	Title	URL	Rationale
S05-R01	KTO: Model Alignment as Prospect Theoretic Optimization (arXiv)	https://arxiv.org/abs/2402.01306	Primary KTO paper
S05-R02	ORPO: Monolithic Preference Optimization (arXiv)	https://arxiv.org/abs/2403.07691	Primary ORPO paper
S05-R03	RLHF and alternatives: Overview (Argilla)	https://argilla.io/blog/mantisnlp-rlhf-part-9/	Comprehensive overview of all variants
S05-R04	DPO Isn't Enough: The Modern Post-Training Stack (Medium)	https://medium.com/@fahey_james/dpo-isnt-enough-the-modern-post-training-stack-simpo-orpo-kto-and-beyond-d82e52a1ee6c	Current stack analysis
S05-R05	Self-Play Preference Optimization (SPPO)	https://uclaml.github.io/SPPO/	Self-play approach

Rejected Results¶

Result	Title	URL	Rationale
S05-R06-30	Various secondary sources	Various	Tutorials, duplicate coverage, or narrow application papers

Notes¶

This search covered the "long tail" of RLHF alternatives that emerged from the DPO lineage. Each addresses a specific limitation: KTO (data requirements), ORPO (reference model dependence), IPO (overfitting), SimPO (simplicity), SPIN (self-play).