R0040/2026-04-01/Q001/SRC03
DeepSeek -- Group Relative Policy Optimization (GRPO)
Source
| Field |
Value |
| Title |
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models |
| Publisher |
arXiv (DeepSeek AI) |
| Author(s) |
Zhihong Shao et al. |
| Date |
2024-02-05 |
| URL |
https://arxiv.org/abs/2402.03300 |
| Type |
Research paper |
Summary
| Dimension |
Rating |
| Reliability |
High |
| Relevance |
High |
| Bias: Missing data |
Low risk |
| Bias: Measurement |
Low risk |
| Bias: Selective reporting |
Some concerns |
| Bias: Randomization |
N/A -- not an RCT |
| Bias: Protocol deviation |
N/A -- not an RCT |
| Bias: COI/Funding |
Some concerns |
Rationale
| Dimension |
Rationale |
| Reliability |
DeepSeek is a major AI lab. Paper is well-cited and GRPO has been widely adopted. |
| Relevance |
Directly introduces GRPO, a key RLHF alternative for reasoning models. |
| Bias flags |
DeepSeek has commercial interest in GRPO's success. Selective reporting concern: benchmarks chosen may favor GRPO. However, independent replications exist. |
| Evidence ID |
Summary |
| SRC03-E01 |
GRPO eliminates critic network, uses group-relative scoring |