SRC06-E01 — GRPO Eliminates the Critic Model¶
Extract¶
GRPO is "a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO." Its key innovation is "completely eliminating the critic model, sampling multiple reasoning traces, and using their average reward as a proxy for the critic." This "cuts in half the compute requirements to do Reinforcement Learning from Human Feedback (RLHF) compared to what was used for ChatGPT (PPO)." DeepSeekMath 7B achieved 51.7% on the MATH benchmark.
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Strongly supports — GRPO is now the most common RL optimizer for open LLMs | Strong |
| H2 | Contradicts — GRPO is widely adopted in production | Strong |
| H3 | Supports — GRPO optimizes the RL algorithm but can use either human or verifiable rewards | Moderate |
Context¶
GRPO gained widespread attention after DeepSeek-R1 and has been adopted or adapted by multiple organizations including Kimi K2 and Qwen 3.
Notes¶
GRPO is typically used with RLVR (verifiable rewards) rather than human preference data, making it part of a broader shift away from human feedback.