Q001 — RLHF Alternatives — Query Definition¶

Query as Received¶

What alternatives to RLHF are being considered or in use by the AI research community?

Subject: Methods for aligning large language models that differ from or improve upon Reinforcement Learning from Human Feedback (RLHF)
Scope: Both academic research proposals and production-deployed alternatives; covers changes to the optimization algorithm (DPO, GRPO), feedback source (RLAIF, RLVR), and training methodology (Constitutional AI, self-play)
Evidence basis: Peer-reviewed papers, pre-prints from major AI labs, and industry deployment evidence

"Alternatives": Could mean complete replacements or incremental improvements. We interpret broadly to include both.
"Being considered": Could mean purely theoretical or actively deployed. We cover both, distinguishing between them.
"AI research community": Could mean academic only or include industry labs. We include both.
"RLHF" boundary: Some methods (e.g., RLAIF) retain the RL framework but change the feedback source. Whether these qualify as "alternatives" depends on how narrowly one defines RLHF.

What methods change the optimization algorithm while keeping human preference data? (DPO, KTO, IPO, ORPO, SimPO)
What methods change the feedback source away from human annotation? (RLAIF, Constitutional AI, RLVR)
What methods change the RL optimizer itself? (GRPO, GSPO)
What methods eliminate RL entirely? (DPO, KTO, self-play)
Which alternatives are in production use vs. research-only?

Hypothesis	Statement	Status
H1	Multiple viable alternatives to RLHF exist and are in active use	Supported
H2	RLHF remains dominant with no viable alternatives	Eliminated
H3	RLHF is being augmented and specialized rather than replaced	Partially supported