SRC03¶

DeepSeek -- Group Relative Policy Optimization (GRPO)

Source¶

Field	Value
Title	DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Publisher	arXiv (DeepSeek AI)
Author(s)	Zhihong Shao et al.
Date	2024-02-05
URL	https://arxiv.org/abs/2402.03300
Type	Research paper

Dimension	Rationale
Reliability	DeepSeek is a major AI lab. Paper is well-cited and GRPO has been widely adopted.
Relevance	Directly introduces GRPO, a key RLHF alternative for reasoning models.
Bias flags	DeepSeek has commercial interest in GRPO's success. Selective reporting concern: benchmarks chosen may favor GRPO. However, independent replications exist.

Evidence ID	Summary
SRC03-E01	GRPO eliminates critic network, uses group-relative scoring