R0040/2026-04-01/Q001/SRC05/E01¶
RLVR replaces learned reward models with programmatic verifiers
URL: https://www.promptfoo.dev/blog/rlvr-explained/
Extract¶
RLVR replaces learned reward models with programmatic verifiers that provide deterministic feedback. The training loop: generate multiple candidates, verify each with a programmatic check (binary reward), update policy to favor high-reward trajectories using GRPO.
Key differences from RLHF: | Method | Signal | Best For | Limitation | |--------|--------|----------|-----------| | RLHF | Human preferences | Subjective quality | Expensive, slow | | RLVR | Programmatic checks | Verifiable tasks | Requires effective verifiers |
Central debate -- gains break down as: - Majority: Search compression (pass@k to pass@1 efficiency) - Minority: True capability expansion (pass@k ceiling lift)
Critical failure modes: 1. Partial verifiers: Models exploit gaps in incomplete verification 2. Spurious rewards: Qwen2.5-Math improved 21.4% with random rewards vs 29.1% with ground truth 3. Entropy collapse: In-distribution gains at cost of out-of-distribution generalization
Relevance to Hypotheses¶
Open-ended query -- maps to thematic clusters:
| Cluster | Relationship | Strength |
|---|---|---|
| Verifiable-reward RL | Supports | Primary evidence for RLVR as distinct paradigm |
| Limitations | Supports nuance | Gains may be compression rather than true capability |
| Task specificity | Supports | RLVR only works for verifiable tasks |
Context¶
RLVR is the most task-constrained alternative -- it only applies where objective correctness criteria exist (math, code, structured decisions). It cannot replace RLHF for open-ended tasks like dialogue or creative writing.