E01¶


Research	R0040 — RLHF Alternatives
Run	2026-04-01
Query	Q001
Source	SRC05
Evidence	SRC05-E01
Type	Analytical

RLVR replaces learned reward models with programmatic verifiers

URL: https://www.promptfoo.dev/blog/rlvr-explained/

Extract¶

RLVR replaces learned reward models with programmatic verifiers that provide deterministic feedback. The training loop: generate multiple candidates, verify each with a programmatic check (binary reward), update policy to favor high-reward trajectories using GRPO.

Key differences from RLHF: | Method | Signal | Best For | Limitation | |--------|--------|----------|-----------| | RLHF | Human preferences | Subjective quality | Expensive, slow | | RLVR | Programmatic checks | Verifiable tasks | Requires effective verifiers |

Central debate -- gains break down as: - Majority: Search compression (pass@k to pass@1 efficiency) - Minority: True capability expansion (pass@k ceiling lift)

Critical failure modes: 1. Partial verifiers: Models exploit gaps in incomplete verification 2. Spurious rewards: Qwen2.5-Math improved 21.4% with random rewards vs 29.1% with ground truth 3. Entropy collapse: In-distribution gains at cost of out-of-distribution generalization

Relevance to Hypotheses¶

Open-ended query -- maps to thematic clusters:

Cluster	Relationship	Strength
Verifiable-reward RL	Supports	Primary evidence for RLVR as distinct paradigm
Limitations	Supports nuance	Gains may be compression rather than true capability
Task specificity	Supports	RLVR only works for verifiable tasks

Context¶

RLVR is the most task-constrained alternative -- it only applies where objective correctness criteria exist (math, code, structured decisions). It cannot replace RLHF for open-ended tasks like dialogue or creative writing.