E01¶


Research	R0041 — Enterprise Sycophancy
Run	2026-04-01
Query	Q003
Source	SRC01
Evidence	SRC01-E01
Type	Factual

RLVR methodology and comparison to preference-based methods

URL: https://www.promptfoo.dev/blog/rlvr-explained/

Extract¶

What RLVR is: RLVR "replaces learned reward models with programmatic verifiers" providing "deterministic feedback (same input always produces the same reward)." The training loop: (1) Generate K candidate solutions per prompt, (2) Verify outputs using programmatic checks, (3) Update policy favoring high-reward trajectories using GRPO, (4) Repeat.

Comparison table:

Method	Reward Signal	Best For	Major Limitation
RLHF	Human preferences	Subjective quality	Expensive, slow
DPO	Preference pairs	Style, tone	Needs good pairs
RLVR	Programmatic check	Verifiable tasks	Needs verifiers

Key distinction: "Human preference data remains superior for subjective quality." RLVR "works where ground truth exists. It fails for creative writing, brand voice, or nuanced argumentation."

Applicable domains: Mathematics, code, Text2SQL (Databricks achieved 73.5-75.68% BIRD test accuracy), logic problems.

Sycophancy connection: Because verifiable rewards rely on "strict, rule-based evaluations rather than learned approximations, there is little room for the LLM to 'hack' the system" -- eliminating the reward model as a sycophancy vector.

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Supports (partially)	RLVR eliminates the reward model sycophancy vector in verifiable domains
H2	Supports	Clear domain limitations confirm partial applicability
H3	Contradicts	RLVR does meaningfully address one sycophancy mechanism