R01¶


Research	R0040 — RLHF Alternatives
Run	2026-03-28
Query	Q001
Search	S03
Result	S03-R01

Original DeepSeekMath paper introducing GRPO.

Summary¶

Field	Value
Title	DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
URL	https://arxiv.org/abs/2402.03300
Date accessed	2026-03-28
Publication date	2024-02-05
Author(s)	Zhihong Shao et al. (DeepSeek)
Publication	arXiv

Selection Decision¶

Included in evidence base: Yes

Rationale: Primary source for GRPO. Subsequently adopted in DeepSeek-R1, one of the most prominent reasoning models. Demonstrates approximately 50% compute reduction vs PPO.