R0041/2026-03-28/Q003/H1¶


Research	R0041 — Enterprise Sycophancy
Run	2026-03-28
Query	Q003
Hypothesis	H1

Statement¶

RLVR can eliminate sycophancy in domains where it applies because its verifiable rewards bypass the preference-based mechanisms that cause sycophancy, and it is effective across a broad range of domains.

Status¶

Current: Partially supported

RLVR's mechanism does fundamentally bypass the preference-based reward signals that cause sycophancy — this part is strongly supported. However, "a broad range of domains" overstates its reach. RLVR is currently limited to domains with verifiable ground truth (math, code, structured queries). Additionally, research shows RLVR achieves "search compression, not expanded reasoning capability," meaning it optimizes selection among paths the model can already generate rather than creating new reasoning abilities.

Supporting Evidence¶

Evidence	Summary
SRC01-E01	RLVR replaces learned reward models with programmatic verifiers, eliminating preference-based bias
SRC04-E01	DeepSeek-R1 demonstrated emergent reasoning via RLVR in math and code domains

Contradicting Evidence¶

Evidence	Summary
SRC01-E01	RLVR fails for creative writing, brand voice, nuanced argumentation — most sycophancy-prone domains
SRC05-E01	Binary reward signals cannot capture subjective quality where sycophancy is most damaging

Reasoning¶

H1 is partially supported because the mechanism is sound (verifiable rewards do bypass preference bias) but the domain limitation is fatal for the "broad range" claim. Sycophancy is most problematic in subjective, open-ended interactions — precisely the domains where RLVR cannot apply.

Relationship to Other Hypotheses¶

H1 shares ground with H3. The distinction is scope: H1 claims broad applicability, H3 acknowledges narrow applicability. Evidence strongly favors H3's more limited claim.