E01¶


Research	R0040 — RLHF Alternatives
Run	2026-03-28
Query	Q001
Source	SRC01
Evidence	SRC01-E01
Type	Reported

Overview of three primary RLHF alternatives in active use.

URL: https://cbtw.tech/insights/rlhf-alternatives-post-training-optimization

Extract¶

The article identifies three primary alternatives to RLHF for post-training optimization:

DPO (Direct Preference Optimization): Sidesteps the need for a reward model or reinforcement learning entirely, reframing preference learning as a classification problem. Directly optimizes the model to prefer one output over another using a binary loss function derived from human preferences.
RLAIF (Reinforcement Learning from AI Feedback): Trains the reward model using preferences generated by a pre-existing LLM rather than by humans. Dramatic cost advantages — AI feedback costs less than $0.01 per data point compared to $1+ for human feedback.
GRPO (Group Relative Policy Optimization): Introduced by DeepSeek. Eliminates the critic model and estimates the baseline from group scores, significantly reducing training resources.

Recent industry adoption includes: Kimi K2 (Self-Critiqued Policy Optimization), Qwen 3 (Group Sequence Policy Optimization), and Claude (shifted to RLAIF).

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Supports	Documents three distinct alternatives with documented adoption by multiple labs
H2	Contradicts	Multiple alternatives clearly exist and are in production use
H3	Supports	All three methods still operate on preference data; RLAIF retains the RL loop

Context¶

This is a secondary source synthesizing information from primary research. The cost comparison ($0.01 vs $1+ per data point for AI vs human feedback) is widely cited but specific figures should be verified against primary sources.