R0040/2026-03-28/Q001/H3¶


Research	R0040 — RLHF Alternatives
Run	2026-03-28
Query	Q001
Hypothesis	H3

Statement¶

Alternatives exist but represent modifications rather than replacements of the RLHF paradigm. Most "alternatives" are variations on preference-based optimization rather than fundamentally different approaches, and the field is evolving RLHF rather than abandoning it.

Status¶

Current: Partially supported

There is meaningful truth to H3. Many alternatives (DPO, KTO, ORPO) can be derived mathematically from the same RLHF objective function — they optimize the same thing but remove intermediate steps. RLAIF changes the feedback source but retains the RL optimization loop. However, some methods (RLVR, GRPO applied to reasoning with verifiable rewards) represent more fundamental departures by replacing subjective preference signals with objective correctness criteria. The picture is nuanced: the field is both evolving RLHF and, in some domains, replacing it.

Supporting Evidence¶

Evidence	Summary
SRC02-E01	DPO is mathematically derived from the RLHF objective — it solves the same optimization problem in closed form
SRC05-E01	KTO authors show DPO and similar methods belong to a family of "human-aware losses" derived from the same principles
SRC06-E01	Constitutional AI replaces the feedback source but retains the RL optimization structure

Contradicting Evidence¶

Evidence	Summary
SRC04-E01	GRPO eliminates the critic model entirely, a structural departure from PPO-based RLHF
SRC07-E01	ORPO eliminates both the reference model and the separate alignment phase

Reasoning¶

H3 captures an important nuance. The DPO paper explicitly derives its objective from the RLHF reward maximization problem — it is mathematically equivalent to RLHF under certain conditions, just solved differently. The KTO authors frame DPO and related methods as belonging to a unified family of loss functions. This supports reading these methods as evolution rather than revolution. However, the emergence of RLVR (using verifiable correctness) and the elimination of key architectural components (critic models, reference models, reward models) does represent structural innovation beyond mere parameter changes. The most accurate characterization is a spectrum: from close variants (DPO) to moderate departures (GRPO, Constitutional AI) to more fundamental shifts (RLVR for reasoning domains).

Relationship to Other Hypotheses¶

H3 provides essential nuance to H1. Both can be simultaneously true: multiple alternatives exist AND most share conceptual DNA with RLHF. The supported answer is closest to H1 with H3 as an important qualifier — the alternatives are real and adopted, but the field is evolving a paradigm rather than abandoning it wholesale.