Skip to content

R0040/2026-03-28/Q001/S03

Research R0040 — RLHF Alternatives
Run 2026-03-28
Query Q001
Search S03

WebSearch — GRPO, KTO, ORPO, and RLVR as additional RLHF alternatives

Summary

Field Value
Source/Database WebSearch (3 queries combined)
Query terms "GRPO group relative policy optimization DeepSeek 2025 2026"; "KTO kahneman tversky optimization DPO alternative 2024 2025 alignment"; "SPIN self-play fine-tuning ORPO odds ratio preference optimization 2024 LLM alignment"; "reinforcement learning verifiable rewards RLVR reasoning models 2025 2026"
Filters None
Results returned 40 (10 per query)
Results selected 5
Results rejected 35

Selected Results

Result Title URL Rationale
S03-R01 DeepSeekMath: Pushing the Limits of Mathematical Reasoning https://arxiv.org/abs/2402.03300 Original GRPO paper — primary source
S03-R02 KTO: Model Alignment as Prospect Theoretic Optimization https://arxiv.org/abs/2402.01306 Original KTO paper — ICML 2024
S03-R03 ORPO: Monolithic Preference Optimization without Reference Model https://arxiv.org/abs/2403.07691 Original ORPO paper
S03-R04 RL with Verifiable Rewards Implicitly Incentivizes Correct Reasoning https://arxiv.org/abs/2506.14245 Key RLVR paper on reasoning
S03-R05 GRPO (Cameron Wolfe deep dive) https://cameronrwolfe.substack.com/p/grpo Detailed technical analysis of GRPO

Rejected Results

Result Title URL Rationale
S03-R06 Various tutorial, blog, and secondary sources Multiple URLs 35 results from four searches: tutorials, blog posts, and secondary analyses that were redundant with the primary papers selected above

Notes

Four separate searches were executed to cover the breadth of newer RLHF alternatives (GRPO, KTO, ORPO, RLVR). In each case, the original academic paper was prioritized over secondary coverage. The 35 rejected results are consolidated into a single entry because they share the same rejection rationale: secondary coverage of methods already captured by the primary papers.