E01¶


Research	R0055 — RLHF Yes-Men Claims
Run	2026-04-01
Claim	C005
Source	SRC01
Evidence	SRC01-E01
Type	Statistical

85% reduction in persona tests, 84% in preference tests using DPO with curated anti-sycophancy pairs

URL: https://experts.umn.edu/en/publications/mitigating-sycophancy-in-large-language-models-via-direct-prefere

Extract¶

Khan et al. fine-tuned LLMs on 1,000 prompts with sycophantic and non-sycophantic response pairs using DPO. Achieved 85% average reduction in persona-based sycophancy tests and 84% in preference-driven tests. The key insight: the data curation drives the reduction, not changes to the optimization algorithm.

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Supports	Strong
H2	Supports	Moderate
H3	Contradicts	Strong

Context¶

Evidence directly relevant to testing the claim's factual assertions.