Skip to content
Research R0040 — RLHF Alternatives
Run 2026-03-29
Query Q001 — RLHF Alternatives
Source SRC06
Evidence SRC06-E01

SRC06-E01 — GRPO Eliminates the Critic Model

Extract

GRPO is "a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO." Its key innovation is "completely eliminating the critic model, sampling multiple reasoning traces, and using their average reward as a proxy for the critic." This "cuts in half the compute requirements to do Reinforcement Learning from Human Feedback (RLHF) compared to what was used for ChatGPT (PPO)." DeepSeekMath 7B achieved 51.7% on the MATH benchmark.

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Strongly supports — GRPO is now the most common RL optimizer for open LLMs Strong
H2 Contradicts — GRPO is widely adopted in production Strong
H3 Supports — GRPO optimizes the RL algorithm but can use either human or verifiable rewards Moderate

Context

GRPO gained widespread attention after DeepSeek-R1 and has been adopted or adapted by multiple organizations including Kimi K2 and Qwen 3.

Notes

GRPO is typically used with RLVR (verifiable rewards) rather than human preference data, making it part of a broader shift away from human feedback.