Skip to content

H1 — Multiple Viable Alternatives to RLHF Exist and Are in Active Use

Statement

The AI research community has developed multiple concrete alternatives to RLHF that are not merely theoretical but are in active production use across major AI laboratories, representing a broad shift away from traditional RLHF.

Status

Supported. Evidence from 8 sources consistently demonstrates that at least 6 distinct families of RLHF alternatives (DPO, RLAIF/CAI, GRPO, KTO, RLVR, ORPO/SimPO) are in active use, with several deployed in production by major AI companies.

Supporting Evidence

Evidence Summary
SRC02-E01 DPO eliminates the reward model, solving RLHF as a classification problem
SRC02-E02 DPO matches or exceeds RLHF on multiple benchmarks
SRC03-E01 Constitutional AI replaces human feedback with principle-based AI self-critique
SRC04-E01 RLAIF matches RLHF at 100x lower cost
SRC05-E01 Systematic catalogue of RLHF problems motivating alternatives
SRC06-E01 GRPO halves compute requirements and is dominant for open LLMs
SRC07-E01 KTO uses binary signals, matching preference-based methods at scale
SRC08-E01 Industry analysis confirms broad shift toward reward optimization

Contradicting Evidence

Evidence Summary
SRC02-E02 DPO underperforms RLHF on out-of-distribution data (Apple, 2025)

Reasoning

The evidence is overwhelming that multiple alternatives exist and are in production. The Apple finding about DPO's OOD limitations prevents a claim of complete RLHF obsolescence, but does not undermine the core finding that alternatives are viable and widely adopted.

Relationship to Other Hypotheses

H1 is the affirmative hypothesis. H2 (negative) is effectively eliminated. H3 (nuanced) adds important context about the coexistence of methods.