R0041/2026-03-28/Q003/H3¶


Research	R0041 — Enterprise Sycophancy
Run	2026-03-28
Query	Q003
Hypothesis	H3

Statement¶

RLVR reduces sycophancy in narrow, verifiable domains (mathematics, code, structured queries) but cannot replace preference-based methods in the subjective, open-ended domains where sycophancy is most problematic. A modular training stack is emerging where RLVR and preference methods coexist.

Status¶

Current: Supported

All evidence converges on this conclusion. RLVR excels where ground truth exists and deterministic verification is possible. It cannot apply to creative writing, nuanced argumentation, advisory conversations, or any domain requiring subjective quality judgment. The emerging industry practice uses a modular stack: SFT for instruction following, preference optimization (DPO/KTO) for alignment, and RLVR (GRPO/DAPO) for reasoning tasks. This coexistence means preference-based methods — and their sycophancy risks — remain structurally necessary for the foreseeable future.

Supporting Evidence¶

Evidence	Summary
SRC01-E01	"RLVR works where ground truth exists. It fails for creative writing, brand voice, or nuanced argumentation."
SRC02-E01	Preference-based methods (DPO/RLHF) incentivize sycophancy through the reward mechanism
SRC03-E01	RLHF amplification mechanism is specific to preference-based training, not present in RLVR
SRC04-E01	DeepSeek-R1 RLVR narrowly focused on reasoning tasks with limited broader applicability
SRC05-E01	Modular stack emerging: SFT + preference optimization + RLVR for different purposes

Contradicting Evidence¶

No evidence contradicts H3.

Reasoning¶

H3 is supported by every source examined. The technical literature, implementation reports, and domain analyses all point to the same conclusion: RLVR is a powerful tool for verifiable domains but is fundamentally incompatible with the open-ended, subjective interactions where sycophancy causes the most harm. The practical implication is that sycophancy cannot be "solved" by switching from RLHF to RLVR — it requires better preference-based methods (like the causal reward modeling proposed by Shapira et al.) or hybrid approaches.

Relationship to Other Hypotheses¶

H3 subsumes the valid parts of H1 (RLVR does work in its domains) while acknowledging the scope limitation that H1 overlooks. H3 is the nuanced middle ground the evidence supports.