E01¶


Research	R0040 — RLHF Alternatives
Run	2026-03-28
Query	Q001
Source	SRC04
Evidence	SRC04-E01
Type	Factual

GRPO eliminates the critic model and halves compute vs PPO-based RLHF.

URL: https://arxiv.org/abs/2402.03300

Extract¶

Group Relative Policy Optimization (GRPO) is a critic-free reinforcement learning algorithm that replaces the value function in PPO with group-normalized rewards while retaining PPO-style token-level importance sampling. Instead of training a separate critic model to estimate a baseline, GRPO generates groups of responses and uses relative group scores as the baseline.

Key characteristics: - Eliminates the critic model entirely, reducing training infrastructure - Approximately halves the compute requirements compared to PPO-based RLHF - First introduced in DeepSeekMath, subsequently adopted for DeepSeek-R1 - Works with both human preference rewards and verifiable rewards (math/code correctness)

GRPO gained widespread attention through DeepSeek-R1, one of the most prominent open reasoning models. Recent academic work (TIC-GRPO at NeurIPS 2025/ICLR 2026) has begun analyzing GRPO's convergence properties and on-policy vs off-policy training dynamics.

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Supports	GRPO is a distinct structural alternative adopted in production
H2	Contradicts	GRPO demonstrates a viable non-PPO approach deployed at scale
H3	N/A	GRPO changes the RL algorithm but still uses RL; whether this counts as "modification" vs "alternative" depends on definition

Context¶

GRPO's adoption for reasoning models (where verifiable rewards exist) represents a potentially more fundamental departure from RLHF than preference-based alternatives. When combined with RLVR (verifiable rewards), the entire human feedback loop is eliminated.