R0040/2026-03-28/Q001/SRC04/E01¶
GRPO eliminates the critic model and halves compute vs PPO-based RLHF.
URL: https://arxiv.org/abs/2402.03300
Extract¶
Group Relative Policy Optimization (GRPO) is a critic-free reinforcement learning algorithm that replaces the value function in PPO with group-normalized rewards while retaining PPO-style token-level importance sampling. Instead of training a separate critic model to estimate a baseline, GRPO generates groups of responses and uses relative group scores as the baseline.
Key characteristics: - Eliminates the critic model entirely, reducing training infrastructure - Approximately halves the compute requirements compared to PPO-based RLHF - First introduced in DeepSeekMath, subsequently adopted for DeepSeek-R1 - Works with both human preference rewards and verifiable rewards (math/code correctness)
GRPO gained widespread attention through DeepSeek-R1, one of the most prominent open reasoning models. Recent academic work (TIC-GRPO at NeurIPS 2025/ICLR 2026) has begun analyzing GRPO's convergence properties and on-policy vs off-policy training dynamics.
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports | GRPO is a distinct structural alternative adopted in production |
| H2 | Contradicts | GRPO demonstrates a viable non-PPO approach deployed at scale |
| H3 | N/A | GRPO changes the RL algorithm but still uses RL; whether this counts as "modification" vs "alternative" depends on definition |
Context¶
GRPO's adoption for reasoning models (where verifiable rewards exist) represents a potentially more fundamental departure from RLHF than preference-based alternatives. When combined with RLVR (verifiable rewards), the entire human feedback loop is eliminated.