H3 — RLHF Is Being Augmented and Specialized Rather Than Replaced¶
Statement¶
Rather than a wholesale replacement of RLHF, the field is evolving toward a diverse toolkit where different methods address different RLHF failure modes, with RLHF-derived techniques coexisting alongside newer approaches in a complementary rather than competitive relationship.
Status¶
Partially supported. Evidence shows both replacement (GRPO/RLVR replacing PPO/RLHF for reasoning) and augmentation (CAI adding principle-based self-critique to RL pipelines). The reality is a spectrum from full replacement to augmentation depending on the use case.
Supporting Evidence¶
| Evidence | Summary |
|---|---|
| SRC02-E02 | DPO has OOD limitations, suggesting no single method replaces RLHF fully |
| SRC03-E01 | CAI partly replaces and partly augments RLHF with principle-based feedback |
| SRC05-E01 | Distinction between tractable and fundamental RLHF problems explains diverse solution landscape |
| SRC06-E01 | GRPO works with both human and verifiable rewards, suggesting method flexibility |
| SRC07-E01 | KTO targets specifically the data collection problem |
| SRC08-E01 | Transition is from preference tuning to reward optimization, not from RL entirely |
Contradicting Evidence¶
| Evidence | Summary |
|---|---|
| SRC04-E01 | RLAIF fully replaces human feedback in some contexts |
| SRC01-E01 | Fundamental RLHF flaws suggest replacement, not mere augmentation, is needed |
Reasoning¶
The evidence supports a nuanced picture. For reasoning tasks, GRPO+RLVR has largely replaced PPO+RLHF. For general alignment, DPO variants compete with but have not eliminated RLHF. For safety training, CAI augments rather than replaces the RL pipeline. The "toolkit" framing is accurate but understates the degree to which some methods are genuinely replacing RLHF in specific domains.
Relationship to Other Hypotheses¶
H3 is the nuanced/conditional hypothesis. It is partially supported alongside the more strongly supported H1. Together they paint the most complete picture: alternatives are real and in use (H1), and the landscape is a heterogeneous toolkit (H3).