S03 — GRPO and RLVR¶

Summary¶


Source / Database	Web (Google via WebSearch) + arXiv
Query terms	"GRPO group relative policy optimization DeepSeek alternative RLHF"; "RLVR reinforcement learning verifiable rewards reasoning models 2025"
Filters	None
Results returned	20 (10 per query)
Results selected	4
Results rejected	16

Result	Title	URL	Rationale
S03-R01	DeepSeekMath (arXiv)	https://arxiv.org/abs/2402.03300	Primary paper introducing GRPO
S03-R02	Group Relative Policy Optimization (Substack)	https://cameronrwolfe.substack.com/p/grpo	Detailed technical analysis
S03-R03	RLVR Explained (Promptfoo)	https://www.promptfoo.dev/blog/rlvr-explained/	Critical analysis of RLVR claims
S03-R04	RLVR Implicitly Incentivizes Correct Reasoning (arXiv)	https://arxiv.org/abs/2506.14245	Primary research on RLVR mechanisms

Result	Title	URL	Rationale
S03-R05	GTPO: Stabilizing GRPO	https://arxiv.org/html/2508.03772v4	Extension work, not core alternative
S03-R06	Training-Free GRPO (OpenReview)	https://openreview.net/forum?id=tyUnYbE7Gi	Variant, not the core method
S03-R07	Demystifying GRPO (arXiv)	https://arxiv.org/html/2603.01162	Theoretical analysis, not adoption data
S03-R08	GRPO in RL Explained (DigitalOcean)	https://www.digitalocean.com/community/conceptual-articles/group-relative-policy-optimization-reinforcement-learning	Tutorial, covered by primary paper
S03-R09	Why GRPO is Important (Oxen.ai)	https://ghost.oxen.ai/why-grpo-is-important-and-how-it-works/	Blog post, covered by primary paper
S03-R10	DeepSeekMath PDF	https://arxiv.org/pdf/2402.03300	Duplicate format
S03-R11	GRPO (DataCamp)	https://www.datacamp.com/blog/what-is-grpo-group-relative-policy-optimization	Tutorial content
S03-R12	GRPO++ Tricks	https://cameronrwolfe.substack.com/p/grpo-tricks	Extension tricks, not core comparison
S03-R13-16	Various RLVR papers	Various	Domain-specific applications or conference posters

Two separate searches combined. GRPO and RLVR represent different aspects of the shift: GRPO changes the optimizer, RLVR changes the reward source.