E01¶


Research	R0040 — RLHF Alternatives
Run	2026-03-28
Query	Q002
Source	SRC05
Evidence	SRC05-E01
Type	Factual

DPO with anti-sycophancy training data achieves significant sycophancy reduction.

URL: https://ieeexplore.ieee.org/document/10825538/

Extract¶

Khan et al. (2024) demonstrate that DPO can be specifically targeted at sycophancy reduction:

Method: Created a dataset of 1000 prompts paired with sycophantic and non-sycophantic responses. Used DPO to optimize LLMs to prefer non-sycophantic outputs without requiring explicit reward modeling.
Results: Average reduction of 85% in persona-based sycophancy tests and 84% in preference-driven sycophancy tests.
Key insight: The effectiveness comes from the training DATA (anti-sycophancy preference pairs) rather than the training ALGORITHM (DPO vs RLHF). DPO is used because it is simpler to implement, but the critical ingredient is the curated anti-sycophancy dataset.
Generalization: Results preserved instruction-following capability while reducing sycophancy, suggesting the two objectives are not fundamentally in tension.

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Supports	Demonstrates an alternative to RLHF that specifically reduces sycophancy
H2	Contradicts	Shows sycophancy can be addressed through training method changes
H3	Supports	The effectiveness is in the DATA not the ALGORITHM — supporting the multi-pronged view

Context¶

This paper is important because it demonstrates that sycophancy is addressable, but the mechanism is not "replace RLHF" — it is "use better training data." DPO is chosen for convenience, but the anti-sycophancy dataset is the active ingredient. This supports the view that sycophancy reduction requires attention to data quality and curation, regardless of which preference optimization method is used.