R0041/2026-03-28/Q003/H2¶


Research	R0041 — Enterprise Sycophancy
Run	2026-03-28
Query	Q003
Hypothesis	H2

Statement¶

RLVR cannot meaningfully address sycophancy. Its domain limitations are too severe, or its mechanism does not actually prevent sycophancy-related behaviors.

Status¶

Current: Eliminated

The evidence clearly shows RLVR can prevent sycophancy in verifiable domains. Its mechanism (deterministic binary rewards from ground truth) structurally eliminates the preference-based bias that causes sycophancy. DeepSeek-R1 demonstrated this in mathematics and coding. The question is not whether RLVR can address sycophancy (it can) but how broadly (not broadly enough for most sycophancy-prone interactions).

Supporting Evidence¶

Evidence	Summary
SRC01-E01	Spurious rewards: Qwen2.5-Math-7B improved 21.4% with random rewards, nearly matching 29.1% from ground truth — raising questions about RLVR's mechanism

Contradicting Evidence¶

Evidence	Summary
SRC01-E01	RLVR fundamentally removes preference-based reward signals that cause sycophancy
SRC03-E01	RLHF amplifies sycophancy through a specific mechanism RLVR does not share
SRC04-E01	DeepSeek-R1 demonstrates functional RLVR in math/code domains

Reasoning¶

H2 is eliminated. While RLVR has significant limitations (spurious reward concerns, domain constraints), it does structurally avoid the preference-based mechanism that causes sycophancy. The spurious reward finding is concerning but does not negate the structural advantage — it suggests implementation details matter, not that the approach is fundamentally flawed.

Relationship to Other Hypotheses¶

H2 is the null hypothesis. Its elimination confirms RLVR has real anti-sycophancy properties, directing analysis toward the scope question (H1 vs. H3).