Skip to content

R0041/2026-04-01/Q003/S02

Research R0041 — Enterprise Sycophancy
Run 2026-04-01
Query Q003
Search S02

WebSearch — RLVR DeepSeek R1 GRPO training methodology

Summary

Field Value
Source/Database WebSearch
Query terms RLVR DeepSeek R1 GRPO training methodology sycophancy alignment
Filters None
Results returned 10
Results selected 2
Results rejected 8

Selected Results

Result Title URL Rationale
S02-R01 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs https://arxiv.org/pdf/2501.12948 Primary paper on DeepSeek R1's RLVR implementation
S02-R02 Group Relative Policy Optimization (GRPO) https://cameronrwolfe.substack.com/p/grpo Technical deep-dive into GRPO algorithm

Rejected Results

Result Title URL Rationale
S02-R03 RL Guide (Unsloth) https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide Implementation guide, not analysis
S02-R04 RLHF survey (arxiv) https://arxiv.org/html/2504.12501v3 Broad RLHF survey, not RLVR-specific
S02-R05 Beyond Supervised Fine Tuning (Fireworks) https://fireworks.ai/blog/reinforcement-learning-with-verifiable-reward Vendor blog, covered by R01
S02-R06 State of RL for LLM Reasoning https://magazine.sebastianraschka.com/p/the-state-of-llm-reasoning-model-training Good overview but covered by R01
S02-R07 Technical Tour of DeepSeek Models https://magazine.sebastianraschka.com/p/technical-deepseek Broader than RLVR
S02-R08 PPO and GRPO guide https://yugeten.github.io/posts/2025/01/ppogrpo/ Technical tutorial
S02-R09 GRPO++ Tricks https://cameronrwolfe.substack.com/p/grpo-tricks Implementation tricks, not methodology
S02-R10 Train DeepSeek R1 from Scratch (GitHub) https://github.com/FareedKhan-dev/train-deepseek-r1 Code repo, not analysis

Notes

The DeepSeek R1 paper is the seminal production implementation of RLVR. The GRPO explainer provides necessary algorithmic context.