R0041/2026-03-28/Q003/SRC01/E01¶
Comprehensive RLVR analysis: mechanism, domain applicability table, three critical failure modes, and comparison to RLHF/DPO.
URL: https://www.promptfoo.dev/blog/rlvr-explained/
Extract¶
Mechanism: RLVR replaces learned reward models with programmatic verifiers providing deterministic binary feedback (1.0 correct, 0.0 incorrect). Training loop generates K candidate solutions per prompt, verifies each, updates policy to favor high-reward trajectories via GRPO.
Method comparison: RLHF uses human preferences (expensive, slow, subjective); DPO uses preference pairs (needs good pairs); RLVR uses programmatic checks (needs verifiers). "You trade generality (RLHF works for any task) for efficiency (RLVR is 3x cheaper on verifiable tasks)."
Sampler vs. Thinker debate: Tsinghua research (April 2025) found RLVR primarily achieves "search compression" — models learn to select existing paths more efficiently, not develop new reasoning. Pass@k ceiling remains flat while pass@1 improves.
Three critical failure modes: (1) Partial verifiers create exploitable gaps (60% error detection leaves 40% exploit space). (2) Spurious rewards: Qwen2.5-Math-7B improved 21.4% on MATH-500 with random rewards, nearly matching 29.1% from ground truth. (3) Entropy instability: in-distribution accuracy rises while out-of-distribution deteriorates.
Domain applicability: Works for math, code, Text2SQL, structured documents. Fails for creative writing, brand voice, nuanced argumentation, medical/legal decisions.
No sycophancy discussion: The article does not address sycophancy. RLVR's focus is search efficiency and verifiable correctness, not preference alignment behaviors.
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports | RLVR mechanism fundamentally differs from preference-based methods, bypassing sycophancy-inducing reward signals |
| H2 | Contradicts | RLVR does structurally avoid preference-based bias in domains where it applies |
| H3 | Supports | Domain limitations ("fails for creative writing, brand voice, nuanced argumentation") align with H3's prediction |
Context¶
The "sampler vs. thinker" debate is important context: even in domains where RLVR works, it may be optimizing selection among existing paths rather than creating new reasoning capability. This has implications for how deeply RLVR can address sycophancy even within its applicable domains.