Q003 — Query Definition¶


Research	R0041 — Enterprise Sycophancy
Run	2026-04-01
Query	Q003

Query as Received¶

What is RLVR (Reinforcement Learning with Verifiable Rewards) and how does it differ from preference-based methods (RLHF, DPO, KTO) in its potential to eliminate sycophancy? What domains does it currently apply to and what are its limitations?

Query as Clarified¶

This is a multi-part query decomposing into:

What is RLVR and how does it work technically?
How does it differ from RLHF, DPO, and KTO?
Does RLVR have the potential to reduce or eliminate sycophancy?
What domains does RLVR currently apply to?
What are its limitations, especially for subjective or open-ended tasks?

Embedded assumptions surfaced: The query assumes RLVR has "potential to eliminate sycophancy." This assumes a causal link between verifiable rewards and sycophancy reduction that must be tested. The query also uses "eliminate" rather than "reduce," setting a high bar.

Open-ended query approach: This query has an open answer space spanning technical methodology, domain applicability, and limitations. Hypotheses are generated because the core sycophancy question is enumerable (can RLVR reduce sycophancy: yes/no/partially).

BLUF¶

RLVR replaces learned reward models (used in RLHF) with programmatic verifiers that provide deterministic binary feedback. This eliminates the reward model as a vector for sycophancy in domains where ground truth is verifiable (mathematics, code, SQL). However, RLVR fundamentally cannot apply to subjective, open-ended, or interpersonal tasks -- precisely the domains where sycophancy is most dangerous. RLVR makes models faster at tasks they already know, rather than smarter, and faces significant limitations including entropy collapse and verifier exploitation. It is a partial solution applicable to a narrow slice of the sycophancy problem.

Scope¶

Domain: Machine learning training methodology, AI alignment
Timeframe: 2024-2026 (RLVR is relatively new at scale)
Testability: Verifiable through published research papers, benchmark results, and technical analyses

Assessment Summary¶

Probability: N/A (open-ended query)

Confidence: Medium-High

Hypothesis outcome: H2 (partial applicability) is best supported. RLVR eliminates one vector for sycophancy (reward model) but only in verifiable domains.

[Full assessment in assessment.md.]

Status¶

Field	Value
Date created	2026-04-01
Date completed	2026-04-01
Researcher profile	Phillip Moore
Prompt version	Unified Research Methodology v1
Revisit by	2026-10-01
Revisit trigger	RLVR successfully extended to subjective tasks or a new training method emerges