H1 — Multiple Viable Alternatives to RLHF Exist and Are in Active Use¶
Statement¶
The AI research community has developed multiple concrete alternatives to RLHF that are not merely theoretical but are in active production use across major AI laboratories, representing a broad shift away from traditional RLHF.
Status¶
Supported. Evidence from 8 sources consistently demonstrates that at least 6 distinct families of RLHF alternatives (DPO, RLAIF/CAI, GRPO, KTO, RLVR, ORPO/SimPO) are in active use, with several deployed in production by major AI companies.
Supporting Evidence¶
| Evidence | Summary |
|---|---|
| SRC02-E01 | DPO eliminates the reward model, solving RLHF as a classification problem |
| SRC02-E02 | DPO matches or exceeds RLHF on multiple benchmarks |
| SRC03-E01 | Constitutional AI replaces human feedback with principle-based AI self-critique |
| SRC04-E01 | RLAIF matches RLHF at 100x lower cost |
| SRC05-E01 | Systematic catalogue of RLHF problems motivating alternatives |
| SRC06-E01 | GRPO halves compute requirements and is dominant for open LLMs |
| SRC07-E01 | KTO uses binary signals, matching preference-based methods at scale |
| SRC08-E01 | Industry analysis confirms broad shift toward reward optimization |
Contradicting Evidence¶
| Evidence | Summary |
|---|---|
| SRC02-E02 | DPO underperforms RLHF on out-of-distribution data (Apple, 2025) |
Reasoning¶
The evidence is overwhelming that multiple alternatives exist and are in production. The Apple finding about DPO's OOD limitations prevents a claim of complete RLHF obsolescence, but does not undermine the core finding that alternatives are viable and widely adopted.
Relationship to Other Hypotheses¶
H1 is the affirmative hypothesis. H2 (negative) is effectively eliminated. H3 (nuanced) adds important context about the coexistence of methods.