R0040/2026-04-01/Q001/S03
WebSearch — GRPO and RLVR methods for reasoning models
Summary
| Field |
Value |
| Source/Database |
WebSearch |
| Query terms |
GRPO group relative policy optimization DeepSeek alternative RLHF 2025; RLVR reinforcement learning verifiable rewards reasoning models 2025 2026 |
| Filters |
None |
| Results returned |
20 (two searches combined) |
| Results selected |
3 |
| Results rejected |
17 |
Selected Results
Rejected Results
| Result |
Title |
URL |
Rationale |
| S03-R04 |
Training-Free GRPO |
https://openreview.net/forum?id=tyUnYbE7Gi |
Variant, not core GRPO |
| S03-R05 |
G2RPO-A: Guided Group Relative Policy Optimization |
https://arxiv.org/html/2508.13023v1 |
Variant paper |
| S03-R06 |
TIC-GRPO |
https://arxiv.org/pdf/2508.02833 |
Implementation variant |
| S03-R07 |
Revisiting GRPO |
https://arxiv.org/html/2505.22257v1 |
Analysis paper, not primary |
| S03-R08 |
GTPO: Stabilizing GRPO |
https://arxiv.org/html/2508.03772 |
Variant paper |
| S03-R09 |
RLVR Implicitly Incentivizes Correct Reasoning |
https://arxiv.org/abs/2506.14245 |
Theoretical paper, less practical |
| S03-R10 |
Knowledge-to-Verification: RLVR in Knowledge-Intensive Domains |
https://openreview.net/forum?id=EVS7SeKBqI |
Domain extension, not core |
| S03-R11 |
RLVR-World |
https://openreview.net/forum?id=jpiSagi8aV |
World model application, not core |
| S03-R12 |
RLVR: The Training Breakthrough (Medium) |
https://medium.com/@raktims2210/rlvr-the-training-breakthrough-that-will-make-reasoning-ai-verifiable-cf4209e79669 |
Popular article, less rigorous |
| S03-R13 |
Does RL Really Incentivize Reasoning Capacity in LLMs? |
https://openreview.net/forum?id=4OsgYD7em5 |
Skeptical analysis, interesting but secondary |
| S03-R14 |
Reasoning Gym (NeurIPS 2025) |
https://neurips.cc/virtual/2025/poster/121745 |
Benchmark tool, not a method |
| S03-R15 |
Bridging Perception and Reasoning: Token Reweighting |
https://arxiv.org/html/2603.25077 |
Multimodal extension, not core |
| S03-R16 |
RLVR emergentmind topic |
https://www.emergentmind.com/topics/reinforcement-learning-with-verified-rewards-rlvr |
Aggregator, not primary |
| S03-R17 |
DeepSeekMath PDF |
https://arxiv.org/pdf/2402.03300 |
Same paper as R01, PDF format |
Notes
GRPO and RLVR searches combined here as they are closely related (GRPO is the standard optimizer used with RLVR). The variant papers (DAPO, DR-GRPO, GTPO, G2RPO-A) indicate an active research area building on GRPO's foundation.