Skip to content
Research R0040 — RLHF Alternatives
Run 2026-03-29
Query Q002 — RLHF and Sycophancy
Source SRC05
Evidence SRC05-E01

SRC05-E01 — Sycophancy Is Linearly Separable and Distinct from Truthfulness

Extract

"Correct-to-incorrect sycophancy signals are most linearly separable within multi-head attention activations." Steering is "most effective in a sparse subset of middle-layer attention heads." The sycophancy direction "differs from previously identified 'truthful' directions, indicating that factual accuracy and resistance to deference involve separate underlying processes." Influential heads "disproportionately attend to expressions of user doubt." Probes trained on TruthfulQA "transfer effectively to other factual QA benchmarks."

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Supports — sycophancy is mechanistically understood, enabling targeted interventions Moderate
H2 Contradicts — active mechanistic research underway Strong
H3 Partially contradicts — surgical interventions may be viable without replacing RLHF Moderate

Context

The finding that sycophancy and truthfulness are separate directions is critically important: it means you cannot simply "add truthfulness" to fix sycophancy; the deference behavior must be addressed independently.

Notes

The finding that influential heads attend to "expressions of user doubt" explains the mechanism by which sycophancy is triggered.