E01¶


Research	R0040 — RLHF Alternatives
Run	2026-04-01
Query	Q001
Source	SRC03
Evidence	SRC03-E01
Type	Factual

GRPO is a critic-free RL alternative using group-relative reward normalization

URL: https://arxiv.org/abs/2402.03300

Extract¶

GRPO is a variant of PPO that estimates advantages through group-wise reward normalization while retaining PPO-style importance sampling. It foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources.

Key results: - GSM8K: 82.9% to 88.2% improvement - MATH: 46.8% to 51.7% improvement - Both in-domain and out-of-domain gains observed - Became the standard RL optimizer for training reasoning models (DeepSeek-R1) - Multiple variants have emerged (DAPO, DR-GRPO, GTPO, G2RPO-A), indicating active research community

Relevance to Hypotheses¶

Open-ended query -- maps to thematic clusters:

Cluster	Relationship	Strength
Critic-free RL	Supports	Primary evidence -- eliminates critic/value network
Reasoning models	Supports	Standard optimizer for DeepSeek-R1 and similar models
Cost reduction	Supports	Significant memory and compute reduction vs PPO

Context¶

GRPO is technically a modification of the RLHF pipeline (it still uses RL optimization) rather than a complete replacement. The key innovation is eliminating the critic network, which reduces memory requirements. It is typically used with RLVR (verifiable rewards) rather than human feedback for reasoning tasks.