Skip to content

R0055/2026-04-01/C005

Research R0055 — RLHF Yes-Men Claims
Run 2026-04-01
Claim C005

Claim: Curating anti-sycophancy preference pairs reduces sycophancy by 84-85%, without changing the RLHF algorithm

BLUF: Correct with attribution caveat. Khan et al. (IEEE BigData 2024) achieved 85% reduction in persona-based tests and 84% in preference-driven tests using DPO with curated anti-sycophancy preference pairs. The method uses DPO rather than RLHF, so 'without changing the RLHF algorithm' is accurate in spirit — the intervention is in the data, not the optimization approach.

Probability: Very likely (80-95%) | Confidence: Medium


Summary

Entity Description
Claim Definition Claim text, scope, status
Assessment Full analytical product with reasoning chain
ACH Matrix Evidence x hypotheses diagnosticity analysis
Self-Audit ROBIS-adapted 5-domain audit

Hypotheses

ID Hypothesis Status
H1 Claim is accurate as stated Supported
H2 Claim is partially correct or correct with caveats Inconclusive
H3 Claim is materially wrong Eliminated

Searches

ID Target Results Selected
S01 anti-sycophancy preference pairs 84% 85% reduction 10 2

Sources

Source Description Reliability Relevance
SRC01 Khan et al. 2024 Medium-High High

Revisit Triggers

  • Replication of the 84-85% figures on different models or larger datasets