S03¶


Research	R0040 — RLHF Alternatives
Run	2026-03-28
Query	Q001
Search	S03

WebSearch — GRPO, KTO, ORPO, and RLVR as additional RLHF alternatives

Summary¶

Field	Value
Source/Database	WebSearch (3 queries combined)
Query terms	"GRPO group relative policy optimization DeepSeek 2025 2026"; "KTO kahneman tversky optimization DPO alternative 2024 2025 alignment"; "SPIN self-play fine-tuning ORPO odds ratio preference optimization 2024 LLM alignment"; "reinforcement learning verifiable rewards RLVR reasoning models 2025 2026"
Filters	None
Results returned	40 (10 per query)
Results selected	5
Results rejected	35

Selected Results¶

Result	Title	URL	Rationale
S03-R01	DeepSeekMath: Pushing the Limits of Mathematical Reasoning	https://arxiv.org/abs/2402.03300	Original GRPO paper — primary source
S03-R02	KTO: Model Alignment as Prospect Theoretic Optimization	https://arxiv.org/abs/2402.01306	Original KTO paper — ICML 2024
S03-R03	ORPO: Monolithic Preference Optimization without Reference Model	https://arxiv.org/abs/2403.07691	Original ORPO paper
S03-R04	RL with Verifiable Rewards Implicitly Incentivizes Correct Reasoning	https://arxiv.org/abs/2506.14245	Key RLVR paper on reasoning
S03-R05	GRPO (Cameron Wolfe deep dive)	https://cameronrwolfe.substack.com/p/grpo	Detailed technical analysis of GRPO

Rejected Results¶

Result	Title	URL	Rationale
S03-R06	Various tutorial, blog, and secondary sources	Multiple URLs	35 results from four searches: tutorials, blog posts, and secondary analyses that were redundant with the primary papers selected above

Notes¶

Four separate searches were executed to cover the breadth of newer RLHF alternatives (GRPO, KTO, ORPO, RLVR). In each case, the original academic paper was prioritized over secondary coverage. The 35 rejected results are consolidated into a single entry because they share the same rejection rationale: secondary coverage of methods already captured by the primary papers.