Skip to content

R0055/2026-04-01/C005/SRC01/E01

Research R0055 — RLHF Yes-Men Claims
Run 2026-04-01
Claim C005
Source SRC01
Evidence SRC01-E01
Type Statistical

85% reduction in persona tests, 84% in preference tests using DPO with curated anti-sycophancy pairs

URL: https://experts.umn.edu/en/publications/mitigating-sycophancy-in-large-language-models-via-direct-prefere

Extract

Khan et al. fine-tuned LLMs on 1,000 prompts with sycophantic and non-sycophantic response pairs using DPO. Achieved 85% average reduction in persona-based sycophancy tests and 84% in preference-driven tests. The key insight: the data curation drives the reduction, not changes to the optimization algorithm.

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Supports Strong
H2 Supports Moderate
H3 Contradicts Strong

Context

Evidence directly relevant to testing the claim's factual assertions.