Skip to content

R0040/2026-03-28/Q001/SRC04

Research R0040 — RLHF Alternatives
Run 2026-03-28
Query Q001
Search S03
Result S03-R01
Source SRC04

DeepSeek paper introducing GRPO for mathematical reasoning.

Source

Field Value
Title DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Publisher DeepSeek
Author(s) Zhihong Shao et al.
Date 2024-02-05
URL https://arxiv.org/abs/2402.03300
Type Research paper

Summary

Dimension Rating
Reliability High
Relevance High
Bias: Missing data Low risk
Bias: Measurement Low risk
Bias: Selective reporting Some concerns
Bias: Randomization N/A
Bias: Protocol deviation N/A
Bias: COI/Funding Some concerns

Rationale

Dimension Rationale
Reliability From a major AI lab with open-source track record. GRPO was subsequently validated through DeepSeek-R1 deployment.
Relevance Introduces a structurally different RL alternative that eliminates the critic model.
Bias flags COI: DeepSeek developed GRPO for their own models. Selective reporting: compute savings claim (~50%) has not been independently replicated.

Evidence Extracts

Evidence ID Summary
SRC04-E01 GRPO eliminates critic model, halves compute requirements vs PPO