Research	R0040 — RLHF Alternatives
Run	2026-03-29
Query	Q001 — RLHF Alternatives
Hypothesis	H3

H3 — RLHF Is Being Augmented and Specialized Rather Than Replaced¶

Statement¶

Rather than a wholesale replacement of RLHF, the field is evolving toward a diverse toolkit where different methods address different RLHF failure modes, with RLHF-derived techniques coexisting alongside newer approaches in a complementary rather than competitive relationship.

Status¶

Partially supported. Evidence shows both replacement (GRPO/RLVR replacing PPO/RLHF for reasoning) and augmentation (CAI adding principle-based self-critique to RL pipelines). The reality is a spectrum from full replacement to augmentation depending on the use case.

Supporting Evidence¶

Evidence	Summary
SRC02-E02	DPO has OOD limitations, suggesting no single method replaces RLHF fully
SRC03-E01	CAI partly replaces and partly augments RLHF with principle-based feedback
SRC05-E01	Distinction between tractable and fundamental RLHF problems explains diverse solution landscape
SRC06-E01	GRPO works with both human and verifiable rewards, suggesting method flexibility
SRC07-E01	KTO targets specifically the data collection problem
SRC08-E01	Transition is from preference tuning to reward optimization, not from RL entirely

Contradicting Evidence¶

Evidence	Summary
SRC04-E01	RLAIF fully replaces human feedback in some contexts
SRC01-E01	Fundamental RLHF flaws suggest replacement, not mere augmentation, is needed

Reasoning¶

The evidence supports a nuanced picture. For reasoning tasks, GRPO+RLVR has largely replaced PPO+RLHF. For general alignment, DPO variants compete with but have not eliminated RLHF. For safety training, CAI augments rather than replaces the RL pipeline. The "toolkit" framing is accurate but understates the degree to which some methods are genuinely replacing RLHF in specific domains.

Relationship to Other Hypotheses¶

H3 is the nuanced/conditional hypothesis. It is partially supported alongside the more strongly supported H1. Together they paint the most complete picture: alternatives are real and in use (H1), and the landscape is a heterogeneous toolkit (H3).