E01¶


Research	R0041 — Enterprise Sycophancy
Run	2026-03-28
Query	Q003
Source	SRC04
Evidence	SRC04-E01
Type	Factual

DeepSeek-R1 uses RLVR with rule-based rewards for math and code domains, with acknowledged limitations in broader areas.

URL: https://arxiv.org/abs/2501.12948

Extract¶

DeepSeek-R1-Zero employs rule-based rewards to deliver precise feedback for mathematical, coding, and logical reasoning domains. For math problems with deterministic results, the model provides final answers in a specified format, enabling reliable rule-based verification. The RL framework achieves superior performance on verifiable tasks such as mathematics and coding competitions. However, the paper acknowledges: "the rule-based RL training stage of DeepSeek-R1-Zero is narrowly focused on reasoning tasks, resulting in limited performance in broader areas such as writing and open-domain question answering." The Group Relative Policy Optimization (GRPO) algorithm is used for training.

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Supports	Demonstrates functional RLVR in math/code with verifiable rewards bypassing preference-based bias
H2	Contradicts	Working RLVR implementation proves the approach is functional
H3	Supports	The paper itself acknowledges "limited performance in broader areas" — confirming narrow applicability

Context¶

DeepSeek-R1 is the paper that catalyzed the current RLVR research wave. Its January 2025 publication led to extensive follow-up research on RLVR's potential and limitations.