Research	R0040 — RLHF Alternatives
Run	2026-03-29
Query	Q001 — RLHF Alternatives
Source	SRC06
Evidence	SRC06-E01

SRC06-E01 — GRPO Eliminates the Critic Model¶

Extract¶

GRPO is "a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO." Its key innovation is "completely eliminating the critic model, sampling multiple reasoning traces, and using their average reward as a proxy for the critic." This "cuts in half the compute requirements to do Reinforcement Learning from Human Feedback (RLHF) compared to what was used for ChatGPT (PPO)." DeepSeekMath 7B achieved 51.7% on the MATH benchmark.

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Strongly supports — GRPO is now the most common RL optimizer for open LLMs	Strong
H2	Contradicts — GRPO is widely adopted in production	Strong
H3	Supports — GRPO optimizes the RL algorithm but can use either human or verifiable rewards	Moderate

Context¶

GRPO gained widespread attention after DeepSeek-R1 and has been adopted or adapted by multiple organizations including Kimi K2 and Qwen 3.

Notes¶

GRPO is typically used with RLVR (verifiable rewards) rather than human preference data, making it part of a broader shift away from human feedback.