R0055/2026-04-01/C005
Claim: Curating anti-sycophancy preference pairs reduces sycophancy by 84-85%, without changing the RLHF algorithm
BLUF: Correct with attribution caveat. Khan et al. (IEEE BigData 2024) achieved 85% reduction in persona-based tests and 84% in preference-driven tests using DPO with curated anti-sycophancy preference pairs. The method uses DPO rather than RLHF, so 'without changing the RLHF algorithm' is accurate in spirit — the intervention is in the data, not the optimization approach.
Probability: Very likely (80-95%) | Confidence: Medium
Summary
Hypotheses
| ID |
Hypothesis |
Status |
| H1 |
Claim is accurate as stated |
Supported |
| H2 |
Claim is partially correct or correct with caveats |
Inconclusive |
| H3 |
Claim is materially wrong |
Eliminated |
Searches
| ID |
Target |
Results |
Selected |
| S01 |
anti-sycophancy preference pairs 84% 85% reduction |
10 |
2 |
Sources
| Source |
Description |
Reliability |
Relevance |
| SRC01 |
Khan et al. 2024 |
Medium-High |
High |
Revisit Triggers
- Replication of the 84-85% figures on different models or larger datasets