Skip to content

R0041/2026-03-28/Q003/SRC01/E01

Research R0041 — Enterprise Sycophancy
Run 2026-03-28
Query Q003
Source SRC01
Evidence SRC01-E01
Type Analytical

Comprehensive RLVR analysis: mechanism, domain applicability table, three critical failure modes, and comparison to RLHF/DPO.

URL: https://www.promptfoo.dev/blog/rlvr-explained/

Extract

Mechanism: RLVR replaces learned reward models with programmatic verifiers providing deterministic binary feedback (1.0 correct, 0.0 incorrect). Training loop generates K candidate solutions per prompt, verifies each, updates policy to favor high-reward trajectories via GRPO.

Method comparison: RLHF uses human preferences (expensive, slow, subjective); DPO uses preference pairs (needs good pairs); RLVR uses programmatic checks (needs verifiers). "You trade generality (RLHF works for any task) for efficiency (RLVR is 3x cheaper on verifiable tasks)."

Sampler vs. Thinker debate: Tsinghua research (April 2025) found RLVR primarily achieves "search compression" — models learn to select existing paths more efficiently, not develop new reasoning. Pass@k ceiling remains flat while pass@1 improves.

Three critical failure modes: (1) Partial verifiers create exploitable gaps (60% error detection leaves 40% exploit space). (2) Spurious rewards: Qwen2.5-Math-7B improved 21.4% on MATH-500 with random rewards, nearly matching 29.1% from ground truth. (3) Entropy instability: in-distribution accuracy rises while out-of-distribution deteriorates.

Domain applicability: Works for math, code, Text2SQL, structured documents. Fails for creative writing, brand voice, nuanced argumentation, medical/legal decisions.

No sycophancy discussion: The article does not address sycophancy. RLVR's focus is search efficiency and verifiable correctness, not preference alignment behaviors.

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Supports RLVR mechanism fundamentally differs from preference-based methods, bypassing sycophancy-inducing reward signals
H2 Contradicts RLVR does structurally avoid preference-based bias in domains where it applies
H3 Supports Domain limitations ("fails for creative writing, brand voice, nuanced argumentation") align with H3's prediction

Context

The "sampler vs. thinker" debate is important context: even in domains where RLVR works, it may be optimizing selection among existing paths rather than creating new reasoning capability. This has implications for how deeply RLVR can address sycophancy even within its applicable domains.