R0041/2026-03-28/Q003/H3¶
Statement¶
RLVR reduces sycophancy in narrow, verifiable domains (mathematics, code, structured queries) but cannot replace preference-based methods in the subjective, open-ended domains where sycophancy is most problematic. A modular training stack is emerging where RLVR and preference methods coexist.
Status¶
Current: Supported
All evidence converges on this conclusion. RLVR excels where ground truth exists and deterministic verification is possible. It cannot apply to creative writing, nuanced argumentation, advisory conversations, or any domain requiring subjective quality judgment. The emerging industry practice uses a modular stack: SFT for instruction following, preference optimization (DPO/KTO) for alignment, and RLVR (GRPO/DAPO) for reasoning tasks. This coexistence means preference-based methods — and their sycophancy risks — remain structurally necessary for the foreseeable future.
Supporting Evidence¶
| Evidence | Summary |
|---|---|
| SRC01-E01 | "RLVR works where ground truth exists. It fails for creative writing, brand voice, or nuanced argumentation." |
| SRC02-E01 | Preference-based methods (DPO/RLHF) incentivize sycophancy through the reward mechanism |
| SRC03-E01 | RLHF amplification mechanism is specific to preference-based training, not present in RLVR |
| SRC04-E01 | DeepSeek-R1 RLVR narrowly focused on reasoning tasks with limited broader applicability |
| SRC05-E01 | Modular stack emerging: SFT + preference optimization + RLVR for different purposes |
Contradicting Evidence¶
No evidence contradicts H3.
Reasoning¶
H3 is supported by every source examined. The technical literature, implementation reports, and domain analyses all point to the same conclusion: RLVR is a powerful tool for verifiable domains but is fundamentally incompatible with the open-ended, subjective interactions where sycophancy causes the most harm. The practical implication is that sycophancy cannot be "solved" by switching from RLHF to RLVR — it requires better preference-based methods (like the causal reward modeling proposed by Shapira et al.) or hybrid approaches.
Relationship to Other Hypotheses¶
H3 subsumes the valid parts of H1 (RLVR does work in its domains) while acknowledging the scope limitation that H1 overlooks. H3 is the nuanced middle ground the evidence supports.