S02¶

WebSearch — RLVR DeepSeek R1 GRPO training methodology

Summary¶

Field	Value
Source/Database	WebSearch
Query terms	RLVR DeepSeek R1 GRPO training methodology sycophancy alignment
Filters	None
Results returned	10
Results selected	2
Results rejected	8

Result	Title	URL	Rationale
S02-R01	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs	https://arxiv.org/pdf/2501.12948	Primary paper on DeepSeek R1's RLVR implementation
S02-R02	Group Relative Policy Optimization (GRPO)	https://cameronrwolfe.substack.com/p/grpo	Technical deep-dive into GRPO algorithm

Result	Title	URL	Rationale
S02-R03	RL Guide (Unsloth)	https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide	Implementation guide, not analysis
S02-R04	RLHF survey (arxiv)	https://arxiv.org/html/2504.12501v3	Broad RLHF survey, not RLVR-specific
S02-R05	Beyond Supervised Fine Tuning (Fireworks)	https://fireworks.ai/blog/reinforcement-learning-with-verifiable-reward	Vendor blog, covered by R01
S02-R06	State of RL for LLM Reasoning	https://magazine.sebastianraschka.com/p/the-state-of-llm-reasoning-model-training	Good overview but covered by R01
S02-R07	Technical Tour of DeepSeek Models	https://magazine.sebastianraschka.com/p/technical-deepseek	Broader than RLVR
S02-R08	PPO and GRPO guide	https://yugeten.github.io/posts/2025/01/ppogrpo/	Technical tutorial
S02-R09	GRPO++ Tricks	https://cameronrwolfe.substack.com/p/grpo-tricks	Implementation tricks, not methodology
S02-R10	Train DeepSeek R1 from Scratch (GitHub)	https://github.com/FareedKhan-dev/train-deepseek-r1	Code repo, not analysis

The DeepSeek R1 paper is the seminal production implementation of RLVR. The GRPO explainer provides necessary algorithmic context.