Skip to content

R0041/2026-04-01/Q003/SRC01/E01

Research R0041 — Enterprise Sycophancy
Run 2026-04-01
Query Q003
Source SRC01
Evidence SRC01-E01
Type Factual

RLVR methodology and comparison to preference-based methods

URL: https://www.promptfoo.dev/blog/rlvr-explained/

Extract

What RLVR is: RLVR "replaces learned reward models with programmatic verifiers" providing "deterministic feedback (same input always produces the same reward)." The training loop: (1) Generate K candidate solutions per prompt, (2) Verify outputs using programmatic checks, (3) Update policy favoring high-reward trajectories using GRPO, (4) Repeat.

Comparison table:

Method Reward Signal Best For Major Limitation
RLHF Human preferences Subjective quality Expensive, slow
DPO Preference pairs Style, tone Needs good pairs
RLVR Programmatic check Verifiable tasks Needs verifiers

Key distinction: "Human preference data remains superior for subjective quality." RLVR "works where ground truth exists. It fails for creative writing, brand voice, or nuanced argumentation."

Applicable domains: Mathematics, code, Text2SQL (Databricks achieved 73.5-75.68% BIRD test accuracy), logic problems.

Sycophancy connection: Because verifiable rewards rely on "strict, rule-based evaluations rather than learned approximations, there is little room for the LLM to 'hack' the system" -- eliminating the reward model as a sycophancy vector.

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Supports (partially) RLVR eliminates the reward model sycophancy vector in verifiable domains
H2 Supports Clear domain limitations confirm partial applicability
H3 Contradicts RLVR does meaningfully address one sycophancy mechanism