R0041/2026-03-28/Q003/H1¶
Statement¶
RLVR can eliminate sycophancy in domains where it applies because its verifiable rewards bypass the preference-based mechanisms that cause sycophancy, and it is effective across a broad range of domains.
Status¶
Current: Partially supported
RLVR's mechanism does fundamentally bypass the preference-based reward signals that cause sycophancy — this part is strongly supported. However, "a broad range of domains" overstates its reach. RLVR is currently limited to domains with verifiable ground truth (math, code, structured queries). Additionally, research shows RLVR achieves "search compression, not expanded reasoning capability," meaning it optimizes selection among paths the model can already generate rather than creating new reasoning abilities.
Supporting Evidence¶
| Evidence | Summary |
|---|---|
| SRC01-E01 | RLVR replaces learned reward models with programmatic verifiers, eliminating preference-based bias |
| SRC04-E01 | DeepSeek-R1 demonstrated emergent reasoning via RLVR in math and code domains |
Contradicting Evidence¶
| Evidence | Summary |
|---|---|
| SRC01-E01 | RLVR fails for creative writing, brand voice, nuanced argumentation — most sycophancy-prone domains |
| SRC05-E01 | Binary reward signals cannot capture subjective quality where sycophancy is most damaging |
Reasoning¶
H1 is partially supported because the mechanism is sound (verifiable rewards do bypass preference bias) but the domain limitation is fatal for the "broad range" claim. Sycophancy is most problematic in subjective, open-ended interactions — precisely the domains where RLVR cannot apply.
Relationship to Other Hypotheses¶
H1 shares ground with H3. The distinction is scope: H1 claims broad applicability, H3 acknowledges narrow applicability. Evidence strongly favors H3's more limited claim.