H2 — RLHF Remains Dominant with No Viable Alternatives¶
Statement¶
Despite research interest, RLHF remains the dominant and preferred alignment method, with proposed alternatives being primarily academic exercises that have not achieved production viability.
Status¶
Eliminated. Multiple alternatives are demonstrably in production use. DPO and its variants are widely adopted in open-source models, RLAIF/CAI powers Anthropic's Claude, GRPO is the dominant RL optimizer for open reasoning models, and RLVR is used for verifiable-reward training.
Supporting Evidence¶
| Evidence | Summary |
|---|---|
| SRC02-E02 | DPO underperforms RLHF on out-of-distribution data, suggesting RLHF retains some advantages |
Contradicting Evidence¶
| Evidence | Summary |
|---|---|
| SRC02-E01 | DPO is widely adopted and matches RLHF performance |
| SRC03-E01 | Constitutional AI is deployed in production |
| SRC04-E01 | RLAIF matches RLHF at 100x lower cost |
| SRC06-E01 | GRPO is the dominant RL optimizer for open LLMs |
| SRC07-E01 | KTO matches preference-based methods |
| SRC08-E01 | Industry-wide transition underway |
Reasoning¶
The sole supporting evidence (DPO's OOD limitations) does not support the hypothesis that RLHF is without viable alternatives. It merely shows one alternative has context-specific limitations. The overwhelming weight of evidence contradicts H2.
Relationship to Other Hypotheses¶
H2 is the negative hypothesis. Its elimination supports H1 and H3 as the competing explanatory frameworks.