S03 — GRPO and RLVR¶
Summary¶
| Source / Database | Web (Google via WebSearch) + arXiv |
| Query terms | "GRPO group relative policy optimization DeepSeek alternative RLHF"; "RLVR reinforcement learning verifiable rewards reasoning models 2025" |
| Filters | None |
| Results returned | 20 (10 per query) |
| Results selected | 4 |
| Results rejected | 16 |
Selected Results¶
| Result | Title | URL | Rationale |
|---|---|---|---|
| S03-R01 | DeepSeekMath (arXiv) | https://arxiv.org/abs/2402.03300 | Primary paper introducing GRPO |
| S03-R02 | Group Relative Policy Optimization (Substack) | https://cameronrwolfe.substack.com/p/grpo | Detailed technical analysis |
| S03-R03 | RLVR Explained (Promptfoo) | https://www.promptfoo.dev/blog/rlvr-explained/ | Critical analysis of RLVR claims |
| S03-R04 | RLVR Implicitly Incentivizes Correct Reasoning (arXiv) | https://arxiv.org/abs/2506.14245 | Primary research on RLVR mechanisms |
Rejected Results¶
| Result | Title | URL | Rationale |
|---|---|---|---|
| S03-R05 | GTPO: Stabilizing GRPO | https://arxiv.org/html/2508.03772v4 | Extension work, not core alternative |
| S03-R06 | Training-Free GRPO (OpenReview) | https://openreview.net/forum?id=tyUnYbE7Gi | Variant, not the core method |
| S03-R07 | Demystifying GRPO (arXiv) | https://arxiv.org/html/2603.01162 | Theoretical analysis, not adoption data |
| S03-R08 | GRPO in RL Explained (DigitalOcean) | https://www.digitalocean.com/community/conceptual-articles/group-relative-policy-optimization-reinforcement-learning | Tutorial, covered by primary paper |
| S03-R09 | Why GRPO is Important (Oxen.ai) | https://ghost.oxen.ai/why-grpo-is-important-and-how-it-works/ | Blog post, covered by primary paper |
| S03-R10 | DeepSeekMath PDF | https://arxiv.org/pdf/2402.03300 | Duplicate format |
| S03-R11 | GRPO (DataCamp) | https://www.datacamp.com/blog/what-is-grpo-group-relative-policy-optimization | Tutorial content |
| S03-R12 | GRPO++ Tricks | https://cameronrwolfe.substack.com/p/grpo-tricks | Extension tricks, not core comparison |
| S03-R13-16 | Various RLVR papers | Various | Domain-specific applications or conference posters |
Notes¶
Two separate searches combined. GRPO and RLVR represent different aspects of the shift: GRPO changes the optimizer, RLVR changes the reward source.