Research	R0040 — RLHF Alternatives
Run	2026-03-29
Query	Q001 — RLHF Alternatives
Hypothesis	H2

H2 — RLHF Remains Dominant with No Viable Alternatives¶

Statement¶

Despite research interest, RLHF remains the dominant and preferred alignment method, with proposed alternatives being primarily academic exercises that have not achieved production viability.

Status¶

Eliminated. Multiple alternatives are demonstrably in production use. DPO and its variants are widely adopted in open-source models, RLAIF/CAI powers Anthropic's Claude, GRPO is the dominant RL optimizer for open reasoning models, and RLVR is used for verifiable-reward training.

Supporting Evidence¶

Evidence	Summary
SRC02-E02	DPO underperforms RLHF on out-of-distribution data, suggesting RLHF retains some advantages

Contradicting Evidence¶

Evidence	Summary
SRC02-E01	DPO is widely adopted and matches RLHF performance
SRC03-E01	Constitutional AI is deployed in production
SRC04-E01	RLAIF matches RLHF at 100x lower cost
SRC06-E01	GRPO is the dominant RL optimizer for open LLMs
SRC07-E01	KTO matches preference-based methods
SRC08-E01	Industry-wide transition underway

Reasoning¶

The sole supporting evidence (DPO's OOD limitations) does not support the hypothesis that RLHF is without viable alternatives. It merely shows one alternative has context-specific limitations. The overwhelming weight of evidence contradicts H2.

Relationship to Other Hypotheses¶

H2 is the negative hypothesis. Its elimination supports H1 and H3 as the competing explanatory frameworks.