Skip to content

H2 — RLHF Remains Dominant with No Viable Alternatives

Statement

Despite research interest, RLHF remains the dominant and preferred alignment method, with proposed alternatives being primarily academic exercises that have not achieved production viability.

Status

Eliminated. Multiple alternatives are demonstrably in production use. DPO and its variants are widely adopted in open-source models, RLAIF/CAI powers Anthropic's Claude, GRPO is the dominant RL optimizer for open reasoning models, and RLVR is used for verifiable-reward training.

Supporting Evidence

Evidence Summary
SRC02-E02 DPO underperforms RLHF on out-of-distribution data, suggesting RLHF retains some advantages

Contradicting Evidence

Evidence Summary
SRC02-E01 DPO is widely adopted and matches RLHF performance
SRC03-E01 Constitutional AI is deployed in production
SRC04-E01 RLAIF matches RLHF at 100x lower cost
SRC06-E01 GRPO is the dominant RL optimizer for open LLMs
SRC07-E01 KTO matches preference-based methods
SRC08-E01 Industry-wide transition underway

Reasoning

The sole supporting evidence (DPO's OOD limitations) does not support the hypothesis that RLHF is without viable alternatives. It merely shows one alternative has context-specific limitations. The overwhelming weight of evidence contradicts H2.

Relationship to Other Hypotheses

H2 is the negative hypothesis. Its elimination supports H1 and H3 as the competing explanatory frameworks.