R0040/2026-03-28/Q001/H2¶
Statement¶
No viable alternatives to RLHF exist; RLHF remains the dominant and only practically viable alignment method in production use.
Status¶
Current: Eliminated
The evidence comprehensively contradicts H2. Multiple alternatives have not only been proposed but adopted in production by major AI labs. Anthropic uses Constitutional AI/RLAIF for Claude, DeepSeek uses GRPO for its reasoning models, and DPO has been widely adopted across the research community and in production systems.
Supporting Evidence¶
No evidence supports H2. Every source examined documents at least one alternative to RLHF that has been empirically validated and/or deployed in production.
Contradicting Evidence¶
| Evidence | Summary |
|---|---|
| SRC02-E01 | DPO demonstrated equal or superior performance to RLHF-PPO at NeurIPS 2023 |
| SRC03-E01 | Anthropic has used Constitutional AI for Claude since 2022, scaling to a 23,000-word constitution by 2026 |
| SRC04-E01 | DeepSeek deployed GRPO in production for DeepSeek-R1 |
| SRC05-E01 | KTO published at ICML 2024 with performance matching DPO across 1B-30B scales |
Reasoning¶
H2 is definitively eliminated by the evidence. The question is not whether alternatives exist — they demonstrably do — but rather the nature and degree of their departure from RLHF (which is the domain of H3).
Relationship to Other Hypotheses¶
H2 represents the null hypothesis and is cleanly eliminated. The remaining question is the balance between H1 (genuinely distinct alternatives) and H3 (modifications within the same paradigm).