Skip to content

S03 — GRPO and RLVR

Summary

Source / Database Web (Google via WebSearch) + arXiv
Query terms "GRPO group relative policy optimization DeepSeek alternative RLHF"; "RLVR reinforcement learning verifiable rewards reasoning models 2025"
Filters None
Results returned 20 (10 per query)
Results selected 4
Results rejected 16

Selected Results

Result Title URL Rationale
S03-R01 DeepSeekMath (arXiv) https://arxiv.org/abs/2402.03300 Primary paper introducing GRPO
S03-R02 Group Relative Policy Optimization (Substack) https://cameronrwolfe.substack.com/p/grpo Detailed technical analysis
S03-R03 RLVR Explained (Promptfoo) https://www.promptfoo.dev/blog/rlvr-explained/ Critical analysis of RLVR claims
S03-R04 RLVR Implicitly Incentivizes Correct Reasoning (arXiv) https://arxiv.org/abs/2506.14245 Primary research on RLVR mechanisms

Rejected Results

Result Title URL Rationale
S03-R05 GTPO: Stabilizing GRPO https://arxiv.org/html/2508.03772v4 Extension work, not core alternative
S03-R06 Training-Free GRPO (OpenReview) https://openreview.net/forum?id=tyUnYbE7Gi Variant, not the core method
S03-R07 Demystifying GRPO (arXiv) https://arxiv.org/html/2603.01162 Theoretical analysis, not adoption data
S03-R08 GRPO in RL Explained (DigitalOcean) https://www.digitalocean.com/community/conceptual-articles/group-relative-policy-optimization-reinforcement-learning Tutorial, covered by primary paper
S03-R09 Why GRPO is Important (Oxen.ai) https://ghost.oxen.ai/why-grpo-is-important-and-how-it-works/ Blog post, covered by primary paper
S03-R10 DeepSeekMath PDF https://arxiv.org/pdf/2402.03300 Duplicate format
S03-R11 GRPO (DataCamp) https://www.datacamp.com/blog/what-is-grpo-group-relative-policy-optimization Tutorial content
S03-R12 GRPO++ Tricks https://cameronrwolfe.substack.com/p/grpo-tricks Extension tricks, not core comparison
S03-R13-16 Various RLVR papers Various Domain-specific applications or conference posters

Notes

Two separate searches combined. GRPO and RLVR represent different aspects of the shift: GRPO changes the optimizer, RLVR changes the reward source.