R0040/2026-03-28/Q001/H1¶
Statement¶
Multiple viable alternatives to RLHF exist and are in active use by the AI research community. These alternatives are theoretically grounded, empirically validated, and have been adopted in production systems by major AI labs.
Status¶
Current: Supported
The evidence strongly supports H1. At least six distinct algorithmic alternatives to RLHF have been proposed, empirically evaluated, and adopted in production. DPO, RLAIF/Constitutional AI, GRPO, KTO, ORPO, and RLVR each represent substantively different approaches, and multiple major AI labs have publicly adopted one or more of these methods.
Supporting Evidence¶
| Evidence | Summary |
|---|---|
| SRC01-E01 | Overview of DPO, RLAIF, and GRPO as distinct post-training alternatives |
| SRC02-E01 | DPO matches or exceeds RLHF on summarization and dialogue tasks |
| SRC03-E01 | Constitutional AI adopted by Anthropic as primary alignment method for Claude |
| SRC04-E01 | GRPO adopted by DeepSeek for R1 reasoning model, halves compute vs PPO |
| SRC05-E01 | KTO matches DPO performance using only binary feedback signals |
| SRC07-E01 | ORPO eliminates reference model requirement entirely |
Contradicting Evidence¶
No evidence directly contradicts H1. However, it should be noted that most alternatives share conceptual lineage with RLHF (see H3), which partially qualifies the degree of independence.
Reasoning¶
The evidence is unambiguous: multiple alternatives exist, are theoretically motivated by distinct principles, and have been deployed in production. DPO (NeurIPS 2023) eliminates the reward model entirely. Constitutional AI/RLAIF (Anthropic, 2022) replaces human feedback with AI feedback guided by principles. GRPO (DeepSeek, 2024) eliminates the critic model. KTO (ICML 2024) uses prospect theory and binary signals instead of preference pairs. ORPO (2024) removes the reference model. RLVR (2025) uses verifiable correctness rather than preference signals. The breadth and depth of adoption across Anthropic, DeepSeek, Meta, and others confirms practical viability.
Relationship to Other Hypotheses¶
H1 is the strongest hypothesis but does not fully exclude H3. While multiple alternatives exist and are in use, many share structural similarities with RLHF (preference-based optimization, policy gradient methods). The distinction between H1 and H3 hinges on whether "alternative" requires fundamental conceptual departure or merely algorithmic novelty. The evidence supports both readings simultaneously.