R0041/2026-04-01/Q003/H2¶
Statement¶
RLVR can reduce sycophancy in specific verifiable domains (math, code, SQL) but not in the subjective domains where sycophancy is most problematic.
Status¶
Current: Supported
Supporting Evidence¶
| Evidence | Summary |
|---|---|
| SRC01-E01 | RLVR provides deterministic feedback eliminating reward model sycophancy vector in math, code, SQL |
| SRC01-E02 | Three failure modes limit RLVR even in verifiable domains: partial verifiers, spurious rewards, entropy collapse |
| SRC03-E01 | Attempts to extend RLVR to open-ended tasks via MCQ reformulation show the fundamental limitation |
| SRC04-E01 | DeepSeek R1 production implementation confirms RLVR works for reasoning but does not claim sycophancy reduction |
Contradicting Evidence¶
| Evidence | Summary |
|---|---|
| None found | No evidence contradicts the partial applicability assessment |
Reasoning¶
All evidence converges on the same conclusion: RLVR is effective within its domain (verifiable tasks) but fundamentally limited outside it. The sycophancy problem is most acute in advisory, interpersonal, and subjective contexts where RLVR cannot apply.
Relationship to Other Hypotheses¶
H2 is the nuanced middle position that the evidence strongly supports. It acknowledges real value while recognizing fundamental limitations.