R0040/2026-03-28/Q001/H2¶


Research	R0040 — RLHF Alternatives
Run	2026-03-28
Query	Q001
Hypothesis	H2

Statement¶

No viable alternatives to RLHF exist; RLHF remains the dominant and only practically viable alignment method in production use.

Status¶

Current: Eliminated

The evidence comprehensively contradicts H2. Multiple alternatives have not only been proposed but adopted in production by major AI labs. Anthropic uses Constitutional AI/RLAIF for Claude, DeepSeek uses GRPO for its reasoning models, and DPO has been widely adopted across the research community and in production systems.

Supporting Evidence¶

No evidence supports H2. Every source examined documents at least one alternative to RLHF that has been empirically validated and/or deployed in production.

Contradicting Evidence¶

Evidence	Summary
SRC02-E01	DPO demonstrated equal or superior performance to RLHF-PPO at NeurIPS 2023
SRC03-E01	Anthropic has used Constitutional AI for Claude since 2022, scaling to a 23,000-word constitution by 2026
SRC04-E01	DeepSeek deployed GRPO in production for DeepSeek-R1
SRC05-E01	KTO published at ICML 2024 with performance matching DPO across 1B-30B scales

Reasoning¶

H2 is definitively eliminated by the evidence. The question is not whether alternatives exist — they demonstrably do — but rather the nature and degree of their departure from RLHF (which is the domain of H3).

Relationship to Other Hypotheses¶

H2 represents the null hypothesis and is cleanly eliminated. The remaining question is the balance between H1 (genuinely distinct alternatives) and H3 (modifications within the same paradigm).