Skip to content

R0041/2026-04-01/Q003/H2

Research R0041 — Enterprise Sycophancy
Run 2026-04-01
Query Q003
Hypothesis H2

Statement

RLVR can reduce sycophancy in specific verifiable domains (math, code, SQL) but not in the subjective domains where sycophancy is most problematic.

Status

Current: Supported

Supporting Evidence

Evidence Summary
SRC01-E01 RLVR provides deterministic feedback eliminating reward model sycophancy vector in math, code, SQL
SRC01-E02 Three failure modes limit RLVR even in verifiable domains: partial verifiers, spurious rewards, entropy collapse
SRC03-E01 Attempts to extend RLVR to open-ended tasks via MCQ reformulation show the fundamental limitation
SRC04-E01 DeepSeek R1 production implementation confirms RLVR works for reasoning but does not claim sycophancy reduction

Contradicting Evidence

Evidence Summary
None found No evidence contradicts the partial applicability assessment

Reasoning

All evidence converges on the same conclusion: RLVR is effective within its domain (verifiable tasks) but fundamentally limited outside it. The sycophancy problem is most acute in advisory, interpersonal, and subjective contexts where RLVR cannot apply.

Relationship to Other Hypotheses

H2 is the nuanced middle position that the evidence strongly supports. It acknowledges real value while recognizing fundamental limitations.