R0040/2026-03-28/Q002/SRC05/E01¶
DPO with anti-sycophancy training data achieves significant sycophancy reduction.
URL: https://ieeexplore.ieee.org/document/10825538/
Extract¶
Khan et al. (2024) demonstrate that DPO can be specifically targeted at sycophancy reduction:
-
Method: Created a dataset of 1000 prompts paired with sycophantic and non-sycophantic responses. Used DPO to optimize LLMs to prefer non-sycophantic outputs without requiring explicit reward modeling.
-
Results: Average reduction of 85% in persona-based sycophancy tests and 84% in preference-driven sycophancy tests.
-
Key insight: The effectiveness comes from the training DATA (anti-sycophancy preference pairs) rather than the training ALGORITHM (DPO vs RLHF). DPO is used because it is simpler to implement, but the critical ingredient is the curated anti-sycophancy dataset.
-
Generalization: Results preserved instruction-following capability while reducing sycophancy, suggesting the two objectives are not fundamentally in tension.
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports | Demonstrates an alternative to RLHF that specifically reduces sycophancy |
| H2 | Contradicts | Shows sycophancy can be addressed through training method changes |
| H3 | Supports | The effectiveness is in the DATA not the ALGORITHM — supporting the multi-pronged view |
Context¶
This paper is important because it demonstrates that sycophancy is addressable, but the mechanism is not "replace RLHF" — it is "use better training data." DPO is chosen for convenience, but the anti-sycophancy dataset is the active ingredient. This supports the view that sycophancy reduction requires attention to data quality and curation, regardless of which preference optimization method is used.