Research	R0040 — RLHF Alternatives
Run	2026-03-29
Query	Q001 — RLHF Alternatives
Hypothesis	H1

H1 — Multiple Viable Alternatives to RLHF Exist and Are in Active Use¶

Statement¶

The AI research community has developed multiple concrete alternatives to RLHF that are not merely theoretical but are in active production use across major AI laboratories, representing a broad shift away from traditional RLHF.

Status¶

Supported. Evidence from 8 sources consistently demonstrates that at least 6 distinct families of RLHF alternatives (DPO, RLAIF/CAI, GRPO, KTO, RLVR, ORPO/SimPO) are in active use, with several deployed in production by major AI companies.

Supporting Evidence¶

Evidence	Summary
SRC02-E01	DPO eliminates the reward model, solving RLHF as a classification problem
SRC02-E02	DPO matches or exceeds RLHF on multiple benchmarks
SRC03-E01	Constitutional AI replaces human feedback with principle-based AI self-critique
SRC04-E01	RLAIF matches RLHF at 100x lower cost
SRC05-E01	Systematic catalogue of RLHF problems motivating alternatives
SRC06-E01	GRPO halves compute requirements and is dominant for open LLMs
SRC07-E01	KTO uses binary signals, matching preference-based methods at scale
SRC08-E01	Industry analysis confirms broad shift toward reward optimization

Contradicting Evidence¶

Evidence	Summary
SRC02-E02	DPO underperforms RLHF on out-of-distribution data (Apple, 2025)

Reasoning¶

The evidence is overwhelming that multiple alternatives exist and are in production. The Apple finding about DPO's OOD limitations prevents a claim of complete RLHF obsolescence, but does not undermine the core finding that alternatives are viable and widely adopted.

Relationship to Other Hypotheses¶

H1 is the affirmative hypothesis. H2 (negative) is effectively eliminated. H3 (nuanced) adds important context about the coexistence of methods.