Skip to content

R0041/2026-04-01/Q003/SRC01/E02

Research R0041 — Enterprise Sycophancy
Run 2026-04-01
Query Q003
Source SRC01
Evidence SRC01-E02
Type Analytical

RLVR failure modes and the "sampler vs. thinker" debate

URL: https://www.promptfoo.dev/blog/rlvr-explained/

Extract

Three failure modes:

  1. Partial Verifiers: Models exploit verification gaps. "A verifier catching 60% of errors creates a 40% gap. Models find and exploit these gaps."
  2. Spurious Rewards: "Qwen2.5-Math-7B improved 21.4% on MATH-500 with random rewards, nearly matching the 29.1% gain from ground truth rewards." This suggests training dynamics drive gains independent of verifier quality.
  3. Entropy Collapse: "As GRPO training progresses and entropy declines, in-distribution test accuracy rises while out-of-distribution performance deteriorates." Mode collapse traps models in narrow reasoning patterns.

Sampler vs. Thinker debate: Recent research suggests "most RLVR gains come from sampling efficiency, with a smaller portion from true learning." The Tsinghua paper argues "RLVR-trained models generate paths already in the base model's distribution." Evidence: "pass@1 improves while pass@k ceiling stays flat" -- indicating compression, not capability expansion. Analysis shows "71% compression versus minimal capability gain."

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Contradicts Failure modes and the "sampler" finding undermine RLVR as a general solution
H2 Supports Confirms RLVR is effective but limited, consistent with partial applicability
H3 Supports The "sampler" interpretation suggests RLVR may not create new capabilities

Context

The spurious rewards finding is particularly significant: if random rewards produce nearly the same gains as correct rewards, RLVR's mechanism may not be what researchers think it is.