C005¶


Research	R0055 — RLHF Yes-Men Claims
Run	2026-04-01
Claim	C005

Claim: Curating anti-sycophancy preference pairs reduces sycophancy by 84-85%, without changing the RLHF algorithm

BLUF: Correct with attribution caveat. Khan et al. (IEEE BigData 2024) achieved 85% reduction in persona-based tests and 84% in preference-driven tests using DPO with curated anti-sycophancy preference pairs. The method uses DPO rather than RLHF, so 'without changing the RLHF algorithm' is accurate in spirit — the intervention is in the data, not the optimization approach.

Probability: Very likely (80-95%) | Confidence: Medium

Summary¶

Entity	Description
Claim Definition	Claim text, scope, status
Assessment	Full analytical product with reasoning chain
ACH Matrix	Evidence x hypotheses diagnosticity analysis
Self-Audit	ROBIS-adapted 5-domain audit

Hypotheses¶

ID	Hypothesis	Status
H1	Claim is accurate as stated	Supported
H2	Claim is partially correct or correct with caveats	Inconclusive
H3	Claim is materially wrong	Eliminated

Searches¶

ID	Target	Results	Selected
S01	anti-sycophancy preference pairs 84% 85% reduction	10	2

Sources¶

Source	Description	Reliability	Relevance
SRC01	Khan et al. 2024	Medium-High	High

Revisit Triggers¶

Replication of the 84-85% figures on different models or larger datasets