R0041/2026-04-01/Q003/SRC01/E02¶
RLVR failure modes and the "sampler vs. thinker" debate
URL: https://www.promptfoo.dev/blog/rlvr-explained/
Extract¶
Three failure modes:
- Partial Verifiers: Models exploit verification gaps. "A verifier catching 60% of errors creates a 40% gap. Models find and exploit these gaps."
- Spurious Rewards: "Qwen2.5-Math-7B improved 21.4% on MATH-500 with random rewards, nearly matching the 29.1% gain from ground truth rewards." This suggests training dynamics drive gains independent of verifier quality.
- Entropy Collapse: "As GRPO training progresses and entropy declines, in-distribution test accuracy rises while out-of-distribution performance deteriorates." Mode collapse traps models in narrow reasoning patterns.
Sampler vs. Thinker debate: Recent research suggests "most RLVR gains come from sampling efficiency, with a smaller portion from true learning." The Tsinghua paper argues "RLVR-trained models generate paths already in the base model's distribution." Evidence: "pass@1 improves while pass@k ceiling stays flat" -- indicating compression, not capability expansion. Analysis shows "71% compression versus minimal capability gain."
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Contradicts | Failure modes and the "sampler" finding undermine RLVR as a general solution |
| H2 | Supports | Confirms RLVR is effective but limited, consistent with partial applicability |
| H3 | Supports | The "sampler" interpretation suggests RLVR may not create new capabilities |
Context¶
The spurious rewards finding is particularly significant: if random rewards produce nearly the same gains as correct rewards, RLVR's mechanism may not be what researchers think it is.