Skip to content

R0040/2026-03-28/Q001/SRC04/E01

Research R0040 — RLHF Alternatives
Run 2026-03-28
Query Q001
Source SRC04
Evidence SRC04-E01
Type Factual

GRPO eliminates the critic model and halves compute vs PPO-based RLHF.

URL: https://arxiv.org/abs/2402.03300

Extract

Group Relative Policy Optimization (GRPO) is a critic-free reinforcement learning algorithm that replaces the value function in PPO with group-normalized rewards while retaining PPO-style token-level importance sampling. Instead of training a separate critic model to estimate a baseline, GRPO generates groups of responses and uses relative group scores as the baseline.

Key characteristics: - Eliminates the critic model entirely, reducing training infrastructure - Approximately halves the compute requirements compared to PPO-based RLHF - First introduced in DeepSeekMath, subsequently adopted for DeepSeek-R1 - Works with both human preference rewards and verifiable rewards (math/code correctness)

GRPO gained widespread attention through DeepSeek-R1, one of the most prominent open reasoning models. Recent academic work (TIC-GRPO at NeurIPS 2025/ICLR 2026) has begun analyzing GRPO's convergence properties and on-policy vs off-policy training dynamics.

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Supports GRPO is a distinct structural alternative adopted in production
H2 Contradicts GRPO demonstrates a viable non-PPO approach deployed at scale
H3 N/A GRPO changes the RL algorithm but still uses RL; whether this counts as "modification" vs "alternative" depends on definition

Context

GRPO's adoption for reasoning models (where verifiable rewards exist) represents a potentially more fundamental departure from RLHF than preference-based alternatives. When combined with RLVR (verifiable rewards), the entire human feedback loop is eliminated.