Skip to content

R0040/2026-04-01/Q001/SRC03

Research R0040 — RLHF Alternatives
Run 2026-04-01
Query Q001
Search S03
Result S03-R01
Source SRC03

DeepSeek -- Group Relative Policy Optimization (GRPO)

Source

Field Value
Title DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Publisher arXiv (DeepSeek AI)
Author(s) Zhihong Shao et al.
Date 2024-02-05
URL https://arxiv.org/abs/2402.03300
Type Research paper

Summary

Dimension Rating
Reliability High
Relevance High
Bias: Missing data Low risk
Bias: Measurement Low risk
Bias: Selective reporting Some concerns
Bias: Randomization N/A -- not an RCT
Bias: Protocol deviation N/A -- not an RCT
Bias: COI/Funding Some concerns

Rationale

Dimension Rationale
Reliability DeepSeek is a major AI lab. Paper is well-cited and GRPO has been widely adopted.
Relevance Directly introduces GRPO, a key RLHF alternative for reasoning models.
Bias flags DeepSeek has commercial interest in GRPO's success. Selective reporting concern: benchmarks chosen may favor GRPO. However, independent replications exist.

Evidence Extracts

Evidence ID Summary
SRC03-E01 GRPO eliminates critic network, uses group-relative scoring