R0041/2026-04-01/Q003/H2¶


Research	R0041 — Enterprise Sycophancy
Run	2026-04-01
Query	Q003
Hypothesis	H2

Statement¶

RLVR can reduce sycophancy in specific verifiable domains (math, code, SQL) but not in the subjective domains where sycophancy is most problematic.

Status¶

Current: Supported

Supporting Evidence¶

Evidence	Summary
SRC01-E01	RLVR provides deterministic feedback eliminating reward model sycophancy vector in math, code, SQL
SRC01-E02	Three failure modes limit RLVR even in verifiable domains: partial verifiers, spurious rewards, entropy collapse
SRC03-E01	Attempts to extend RLVR to open-ended tasks via MCQ reformulation show the fundamental limitation
SRC04-E01	DeepSeek R1 production implementation confirms RLVR works for reasoning but does not claim sycophancy reduction

Contradicting Evidence¶

Evidence	Summary
None found	No evidence contradicts the partial applicability assessment

Reasoning¶

All evidence converges on the same conclusion: RLVR is effective within its domain (verifiable tasks) but fundamentally limited outside it. The sycophancy problem is most acute in advisory, interpersonal, and subjective contexts where RLVR cannot apply.

Relationship to Other Hypotheses¶

H2 is the nuanced middle position that the evidence strongly supports. It acknowledges real value while recognizing fundamental limitations.