E01¶


Research	R0040 — RLHF Alternatives
Run	2026-03-28
Query	Q001
Source	SRC07
Evidence	SRC07-E01
Type	Factual

ORPO combines instruction tuning and preference alignment in a single phase.

URL: https://arxiv.org/abs/2403.07691

Extract¶

ORPO (Odds Ratio Preference Optimization) is a reference model-free monolithic preference optimization algorithm that eliminates the necessity for an additional preference alignment phase. Key characteristics:

Uses log odds ratios to directly contrast favored and disfavored responses during SFT (supervised fine-tuning)
Combines instruction tuning and preference alignment in a single process
Reference model-free — eliminates the need to maintain a frozen copy of the base model
Computationally more efficient than both RLHF and DPO

ORPO represents the furthest simplification of the alignment pipeline: where RLHF requires SFT + reward model training + RL optimization (three stages), and DPO requires SFT + preference optimization (two stages), ORPO achieves alignment in a single stage.

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Supports	ORPO is a structurally distinct alternative that eliminates multiple RLHF components
H2	Contradicts	Demonstrates that alignment can be achieved with far less infrastructure than RLHF
H3	Contradicts	ORPO's elimination of both the reference model and the separate alignment phase is a structural departure, not merely a modification

Context¶

ORPO has received less production adoption than DPO or GRPO, but its theoretical contribution — demonstrating that alignment does not require a separate phase — has influenced subsequent work. Its single-stage approach is conceptually the simplest of all RLHF alternatives.