R0041/2026-03-28/Q003/SRC04/E01¶
DeepSeek-R1 uses RLVR with rule-based rewards for math and code domains, with acknowledged limitations in broader areas.
URL: https://arxiv.org/abs/2501.12948
Extract¶
DeepSeek-R1-Zero employs rule-based rewards to deliver precise feedback for mathematical, coding, and logical reasoning domains. For math problems with deterministic results, the model provides final answers in a specified format, enabling reliable rule-based verification. The RL framework achieves superior performance on verifiable tasks such as mathematics and coding competitions. However, the paper acknowledges: "the rule-based RL training stage of DeepSeek-R1-Zero is narrowly focused on reasoning tasks, resulting in limited performance in broader areas such as writing and open-domain question answering." The Group Relative Policy Optimization (GRPO) algorithm is used for training.
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports | Demonstrates functional RLVR in math/code with verifiable rewards bypassing preference-based bias |
| H2 | Contradicts | Working RLVR implementation proves the approach is functional |
| H3 | Supports | The paper itself acknowledges "limited performance in broader areas" — confirming narrow applicability |
Context¶
DeepSeek-R1 is the paper that catalyzed the current RLVR research wave. Its January 2025 publication led to extensive follow-up research on RLVR's potential and limitations.