R0040/2026-04-01/Q001/SRC03/E01¶
GRPO is a critic-free RL alternative using group-relative reward normalization
URL: https://arxiv.org/abs/2402.03300
Extract¶
GRPO is a variant of PPO that estimates advantages through group-wise reward normalization while retaining PPO-style importance sampling. It foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources.
Key results: - GSM8K: 82.9% to 88.2% improvement - MATH: 46.8% to 51.7% improvement - Both in-domain and out-of-domain gains observed - Became the standard RL optimizer for training reasoning models (DeepSeek-R1) - Multiple variants have emerged (DAPO, DR-GRPO, GTPO, G2RPO-A), indicating active research community
Relevance to Hypotheses¶
Open-ended query -- maps to thematic clusters:
| Cluster | Relationship | Strength |
|---|---|---|
| Critic-free RL | Supports | Primary evidence -- eliminates critic/value network |
| Reasoning models | Supports | Standard optimizer for DeepSeek-R1 and similar models |
| Cost reduction | Supports | Significant memory and compute reduction vs PPO |
Context¶
GRPO is technically a modification of the RLHF pipeline (it still uses RL optimization) rather than a complete replacement. The key innovation is eliminating the critic network, which reduces memory requirements. It is typically used with RLVR (verifiable rewards) rather than human feedback for reasoning tasks.