Skip to content

R0041/2026-03-28/Q003/H3

Research R0041 — Enterprise Sycophancy
Run 2026-03-28
Query Q003
Hypothesis H3

Statement

RLVR reduces sycophancy in narrow, verifiable domains (mathematics, code, structured queries) but cannot replace preference-based methods in the subjective, open-ended domains where sycophancy is most problematic. A modular training stack is emerging where RLVR and preference methods coexist.

Status

Current: Supported

All evidence converges on this conclusion. RLVR excels where ground truth exists and deterministic verification is possible. It cannot apply to creative writing, nuanced argumentation, advisory conversations, or any domain requiring subjective quality judgment. The emerging industry practice uses a modular stack: SFT for instruction following, preference optimization (DPO/KTO) for alignment, and RLVR (GRPO/DAPO) for reasoning tasks. This coexistence means preference-based methods — and their sycophancy risks — remain structurally necessary for the foreseeable future.

Supporting Evidence

Evidence Summary
SRC01-E01 "RLVR works where ground truth exists. It fails for creative writing, brand voice, or nuanced argumentation."
SRC02-E01 Preference-based methods (DPO/RLHF) incentivize sycophancy through the reward mechanism
SRC03-E01 RLHF amplification mechanism is specific to preference-based training, not present in RLVR
SRC04-E01 DeepSeek-R1 RLVR narrowly focused on reasoning tasks with limited broader applicability
SRC05-E01 Modular stack emerging: SFT + preference optimization + RLVR for different purposes

Contradicting Evidence

No evidence contradicts H3.

Reasoning

H3 is supported by every source examined. The technical literature, implementation reports, and domain analyses all point to the same conclusion: RLVR is a powerful tool for verifiable domains but is fundamentally incompatible with the open-ended, subjective interactions where sycophancy causes the most harm. The practical implication is that sycophancy cannot be "solved" by switching from RLHF to RLVR — it requires better preference-based methods (like the causal reward modeling proposed by Shapira et al.) or hybrid approaches.

Relationship to Other Hypotheses

H3 subsumes the valid parts of H1 (RLVR does work in its domains) while acknowledging the scope limitation that H1 overlooks. H3 is the nuanced middle ground the evidence supports.