Skip to content

R0040/2026-03-28/Q002/SRC05/E01

Research R0040 — RLHF Alternatives
Run 2026-03-28
Query Q002
Source SRC05
Evidence SRC05-E01
Type Factual

DPO with anti-sycophancy training data achieves significant sycophancy reduction.

URL: https://ieeexplore.ieee.org/document/10825538/

Extract

Khan et al. (2024) demonstrate that DPO can be specifically targeted at sycophancy reduction:

  1. Method: Created a dataset of 1000 prompts paired with sycophantic and non-sycophantic responses. Used DPO to optimize LLMs to prefer non-sycophantic outputs without requiring explicit reward modeling.

  2. Results: Average reduction of 85% in persona-based sycophancy tests and 84% in preference-driven sycophancy tests.

  3. Key insight: The effectiveness comes from the training DATA (anti-sycophancy preference pairs) rather than the training ALGORITHM (DPO vs RLHF). DPO is used because it is simpler to implement, but the critical ingredient is the curated anti-sycophancy dataset.

  4. Generalization: Results preserved instruction-following capability while reducing sycophancy, suggesting the two objectives are not fundamentally in tension.

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Supports Demonstrates an alternative to RLHF that specifically reduces sycophancy
H2 Contradicts Shows sycophancy can be addressed through training method changes
H3 Supports The effectiveness is in the DATA not the ALGORITHM — supporting the multi-pronged view

Context

This paper is important because it demonstrates that sycophancy is addressable, but the mechanism is not "replace RLHF" — it is "use better training data." DPO is chosen for convenience, but the anti-sycophancy dataset is the active ingredient. This supports the view that sycophancy reduction requires attention to data quality and curation, regardless of which preference optimization method is used.