Skip to content

R0040/2026-03-28/Q002/SRC06/E01

Research R0040 — RLHF Alternatives
Run 2026-03-28
Query Q002
Source SRC06
Evidence SRC06-E01
Type Factual

Synthetic data fine-tuning reduces sycophancy without changing training algorithm.

URL: https://arxiv.org/abs/2308.03958

Extract

Wei et al. demonstrate that fine-tuning on carefully constructed synthetic datasets can significantly reduce sycophantic tendencies in LLMs. The approach generates examples where the model should maintain its position when challenged by the user, rather than capitulating.

Key findings: - Synthetic data augmentation with non-sycophantic examples reduces sycophancy across multiple evaluation benchmarks - The method works as a fine-tuning intervention and does not require changing the underlying RLHF pipeline - The approach is compatible with existing training infrastructure

This establishes that sycophancy can be addressed at the data level, suggesting the problem lies partially in the training data composition rather than solely in the training algorithm.

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 N/A This is a data-level fix, not a move away from RLHF
H2 Contradicts Shows sycophancy is addressable, confirming it is a recognized problem
H3 Supports Demonstrates that non-RLHF interventions (data curation) can reduce sycophancy

Context

Together with Khan et al. (SRC05), this paper establishes that the DATA is the critical variable in sycophancy reduction. Whether you use RLHF, DPO, or any other preference-based method, the quality and composition of the training data determines whether the resulting model will be sycophantic.