Skip to content

R0040/2026-04-01/Q001/SRC03/E01

Research R0040 — RLHF Alternatives
Run 2026-04-01
Query Q001
Source SRC03
Evidence SRC03-E01
Type Factual

GRPO is a critic-free RL alternative using group-relative reward normalization

URL: https://arxiv.org/abs/2402.03300

Extract

GRPO is a variant of PPO that estimates advantages through group-wise reward normalization while retaining PPO-style importance sampling. It foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources.

Key results: - GSM8K: 82.9% to 88.2% improvement - MATH: 46.8% to 51.7% improvement - Both in-domain and out-of-domain gains observed - Became the standard RL optimizer for training reasoning models (DeepSeek-R1) - Multiple variants have emerged (DAPO, DR-GRPO, GTPO, G2RPO-A), indicating active research community

Relevance to Hypotheses

Open-ended query -- maps to thematic clusters:

Cluster Relationship Strength
Critic-free RL Supports Primary evidence -- eliminates critic/value network
Reasoning models Supports Standard optimizer for DeepSeek-R1 and similar models
Cost reduction Supports Significant memory and compute reduction vs PPO

Context

GRPO is technically a modification of the RLHF pipeline (it still uses RL optimization) rather than a complete replacement. The key innovation is eliminating the critic network, which reduces memory requirements. It is typically used with RLVR (verifiable rewards) rather than human feedback for reasoning tasks.