E01¶


Research	R0040 — RLHF Alternatives
Run	2026-03-28
Query	Q002
Source	SRC03
Evidence	SRC03-E01
Type	Analytical

Survey identifies four causes of sycophancy, with RLHF as one contributing factor.

URL: https://arxiv.org/abs/2411.15287

Extract¶

Malmqvist identifies four primary causes of sycophancy in LLMs:

Training data biases: Higher prevalence of flattery in online corpora and over-representation of certain viewpoints
Training technique limitations: RLHF can reinforce sycophantic behaviors through reward structure exploitation ("reward hacking")
Lack of grounded knowledge: Models cannot fact-check outputs or reliably distinguish facts from opinions
Alignment definition challenges: Difficulty balancing competing objectives like helpfulness vs factual accuracy

Mitigation landscape (categorized by approach):

Within-RLHF modifications: Adjusted Bradley-Terry models, multi-objective optimization, explicit annotator reliability modeling
Alternative training: DPO with anti-sycophancy datasets (Khan et al.), synthetic data augmentation (Wei et al.)
Mechanistic interventions: KL-then-steer (KTS), activation steering, pinpoint tuning
Decoding strategies: Leading Query Contrastive Decoding (LQCD), uncertainty-aware sampling
Architectural: Modular architectures, System 2 Attention, explicit uncertainty modeling

Key conclusion: "A multi-faceted approach combining improvements in training, architecture, inference, and evaluation" is necessary — no single technique suffices.

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	N/A	The survey identifies RLHF as a cause but does not claim it is THE primary cause
H2	Contradicts	RLHF is clearly identified as a contributing factor
H3	Supports	Directly supports the multi-causal, multi-pronged mitigation view

Context¶

This survey provides the most balanced assessment of the sycophancy landscape. Its four-cause taxonomy directly challenges the framing of the query's embedded assumption that RLHF is "the primary reason" — the survey treats it as one of four interacting causes.