R0040/2026-03-28/Q001/S03
WebSearch — GRPO, KTO, ORPO, and RLVR as additional RLHF alternatives
Summary
| Field |
Value |
| Source/Database |
WebSearch (3 queries combined) |
| Query terms |
"GRPO group relative policy optimization DeepSeek 2025 2026"; "KTO kahneman tversky optimization DPO alternative 2024 2025 alignment"; "SPIN self-play fine-tuning ORPO odds ratio preference optimization 2024 LLM alignment"; "reinforcement learning verifiable rewards RLVR reasoning models 2025 2026" |
| Filters |
None |
| Results returned |
40 (10 per query) |
| Results selected |
5 |
| Results rejected |
35 |
Selected Results
| Result |
Title |
URL |
Rationale |
| S03-R01 |
DeepSeekMath: Pushing the Limits of Mathematical Reasoning |
https://arxiv.org/abs/2402.03300 |
Original GRPO paper — primary source |
| S03-R02 |
KTO: Model Alignment as Prospect Theoretic Optimization |
https://arxiv.org/abs/2402.01306 |
Original KTO paper — ICML 2024 |
| S03-R03 |
ORPO: Monolithic Preference Optimization without Reference Model |
https://arxiv.org/abs/2403.07691 |
Original ORPO paper |
| S03-R04 |
RL with Verifiable Rewards Implicitly Incentivizes Correct Reasoning |
https://arxiv.org/abs/2506.14245 |
Key RLVR paper on reasoning |
| S03-R05 |
GRPO (Cameron Wolfe deep dive) |
https://cameronrwolfe.substack.com/p/grpo |
Detailed technical analysis of GRPO |
Rejected Results
| Result |
Title |
URL |
Rationale |
| S03-R06 |
Various tutorial, blog, and secondary sources |
Multiple URLs |
35 results from four searches: tutorials, blog posts, and secondary analyses that were redundant with the primary papers selected above |
Notes
Four separate searches were executed to cover the breadth of newer RLHF alternatives (GRPO, KTO, ORPO, RLVR). In each case, the original academic paper was prioritized over secondary coverage. The 35 rejected results are consolidated into a single entry because they share the same rejection rationale: secondary coverage of methods already captured by the primary papers.