C004¶


Research	R0057 — RLHF Yes-Men Claims v3
Run	2026-04-01
Claim	C004

Claim: Curating anti-sycophancy preference pairs — training data where the correct answer disagrees with the user — dramatically reduces sycophancy without changing the algorithm at all.

BLUF: Confirmed. Multiple studies demonstrate that data-level interventions reduce sycophancy. Shapira et al. derive a closed-form agreement penalty as a minimal reward correction. Wei et al. show synthetic data reduces sycophancy 4.7-10%.

Probability: Very likely (80-95%) | Confidence: High

Summary¶

Entity	Description
Claim Definition	Claim text, scope, status
Assessment	Full analytical product with reasoning chain
ACH Matrix	Evidence x hypotheses diagnosticity analysis
Self-Audit	ROBIS-adapted 5-domain audit

Hypotheses¶

ID	Hypothesis	Status
H1	Data-level interventions effectively reduce sycophancy	Supported
H2	The effect is real but 'dramatically' overstates the magnitude	Not supported
H3	Data-level interventions do not reduce sycophancy	Eliminated

Searches¶

ID	Target	Results	Selected
S01	Anti-sycophancy preference pairs training data reduces sycophancy	10	1

Sources¶

Source	Description	Reliability	Relevance
SRC01	Shapira et al. (2026) and Wei et al. (2023) — data-level sycophancy interventions	High	High

Revisit Triggers¶

If data-level interventions are shown to be ineffective at scale