R0040/2026-03-28/Q002/SRC05
DPO-based sycophancy mitigation achieving 84-85% reduction.
Source
| Field |
Value |
| Title |
Mitigating Sycophancy in Large Language Models via Direct Preference Optimization |
| Publisher |
IEEE International Conference on Big Data 2024 |
| Author(s) |
Khan et al. |
| Date |
2024 |
| URL |
https://ieeexplore.ieee.org/document/10825538/ |
| Type |
Research paper (peer-reviewed) |
Summary
| Dimension |
Rating |
| Reliability |
Medium-High |
| Relevance |
High |
| Bias: Missing data |
Some concerns |
| Bias: Measurement |
Some concerns |
| Bias: Selective reporting |
Some concerns |
| Bias: Randomization |
N/A |
| Bias: Protocol deviation |
N/A |
| Bias: COI/Funding |
Low risk |
Rationale
| Dimension |
Rationale |
| Reliability |
Peer-reviewed at IEEE Big Data. However, the dataset is relatively small (1000 prompts) and results may not generalize across all sycophancy types. |
| Relevance |
Directly demonstrates that DPO (an RLHF alternative) can specifically target sycophancy reduction when paired with appropriate training data. |
| Bias flags |
Measurement: sycophancy reduction measured on specific test sets — may not generalize. Missing data: limited to persona-based and preference-driven sycophancy; does not cover all types. Selective reporting: 84-85% reduction is an average; variance not reported. |
| Evidence ID |
Summary |
| SRC05-E01 |
DPO with anti-sycophancy preference pairs reduces sycophancy 84-85% |