R0041/2026-04-01/Q003/S02¶
WebSearch — RLVR DeepSeek R1 GRPO training methodology
Summary¶
| Field | Value |
|---|---|
| Source/Database | WebSearch |
| Query terms | RLVR DeepSeek R1 GRPO training methodology sycophancy alignment |
| Filters | None |
| Results returned | 10 |
| Results selected | 2 |
| Results rejected | 8 |
Selected Results¶
| Result | Title | URL | Rationale |
|---|---|---|---|
| S02-R01 | DeepSeek-R1: Incentivizing Reasoning Capability in LLMs | https://arxiv.org/pdf/2501.12948 | Primary paper on DeepSeek R1's RLVR implementation |
| S02-R02 | Group Relative Policy Optimization (GRPO) | https://cameronrwolfe.substack.com/p/grpo | Technical deep-dive into GRPO algorithm |
Rejected Results¶
Notes¶
The DeepSeek R1 paper is the seminal production implementation of RLVR. The GRPO explainer provides necessary algorithmic context.