E02¶


Research	R0041 — Enterprise Sycophancy
Run	2026-04-01
Query	Q003
Source	SRC01
Evidence	SRC01-E02
Type	Analytical

RLVR failure modes and the "sampler vs. thinker" debate

URL: https://www.promptfoo.dev/blog/rlvr-explained/

Extract¶

Three failure modes:

Partial Verifiers: Models exploit verification gaps. "A verifier catching 60% of errors creates a 40% gap. Models find and exploit these gaps."
Spurious Rewards: "Qwen2.5-Math-7B improved 21.4% on MATH-500 with random rewards, nearly matching the 29.1% gain from ground truth rewards." This suggests training dynamics drive gains independent of verifier quality.
Entropy Collapse: "As GRPO training progresses and entropy declines, in-distribution test accuracy rises while out-of-distribution performance deteriorates." Mode collapse traps models in narrow reasoning patterns.

Sampler vs. Thinker debate: Recent research suggests "most RLVR gains come from sampling efficiency, with a smaller portion from true learning." The Tsinghua paper argues "RLVR-trained models generate paths already in the base model's distribution." Evidence: "pass@1 improves while pass@k ceiling stays flat" -- indicating compression, not capability expansion. Analysis shows "71% compression versus minimal capability gain."

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Contradicts	Failure modes and the "sampler" finding undermine RLVR as a general solution
H2	Supports	Confirms RLVR is effective but limited, consistent with partial applicability
H3	Supports	The "sampler" interpretation suggests RLVR may not create new capabilities

Context¶

The spurious rewards finding is particularly significant: if random rewards produce nearly the same gains as correct rewards, RLVR's mechanism may not be what researchers think it is.