R0041/2026-03-28/Q003/H2¶
Statement¶
RLVR cannot meaningfully address sycophancy. Its domain limitations are too severe, or its mechanism does not actually prevent sycophancy-related behaviors.
Status¶
Current: Eliminated
The evidence clearly shows RLVR can prevent sycophancy in verifiable domains. Its mechanism (deterministic binary rewards from ground truth) structurally eliminates the preference-based bias that causes sycophancy. DeepSeek-R1 demonstrated this in mathematics and coding. The question is not whether RLVR can address sycophancy (it can) but how broadly (not broadly enough for most sycophancy-prone interactions).
Supporting Evidence¶
| Evidence | Summary |
|---|---|
| SRC01-E01 | Spurious rewards: Qwen2.5-Math-7B improved 21.4% with random rewards, nearly matching 29.1% from ground truth — raising questions about RLVR's mechanism |
Contradicting Evidence¶
| Evidence | Summary |
|---|---|
| SRC01-E01 | RLVR fundamentally removes preference-based reward signals that cause sycophancy |
| SRC03-E01 | RLHF amplifies sycophancy through a specific mechanism RLVR does not share |
| SRC04-E01 | DeepSeek-R1 demonstrates functional RLVR in math/code domains |
Reasoning¶
H2 is eliminated. While RLVR has significant limitations (spurious reward concerns, domain constraints), it does structurally avoid the preference-based mechanism that causes sycophancy. The spurious reward finding is concerning but does not negate the structural advantage — it suggests implementation details matter, not that the approach is fundamentally flawed.
Relationship to Other Hypotheses¶
H2 is the null hypothesis. Its elimination confirms RLVR has real anti-sycophancy properties, directing analysis toward the scope question (H1 vs. H3).