Research	R0040 — RLHF Alternatives
Run	2026-03-29
Query	Q002 — RLHF and Sycophancy
Source	SRC05
Evidence	SRC05-E01

SRC05-E01 — Sycophancy Is Linearly Separable and Distinct from Truthfulness¶

Extract¶

"Correct-to-incorrect sycophancy signals are most linearly separable within multi-head attention activations." Steering is "most effective in a sparse subset of middle-layer attention heads." The sycophancy direction "differs from previously identified 'truthful' directions, indicating that factual accuracy and resistance to deference involve separate underlying processes." Influential heads "disproportionately attend to expressions of user doubt." Probes trained on TruthfulQA "transfer effectively to other factual QA benchmarks."

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Supports — sycophancy is mechanistically understood, enabling targeted interventions	Moderate
H2	Contradicts — active mechanistic research underway	Strong
H3	Partially contradicts — surgical interventions may be viable without replacing RLHF	Moderate

Context¶

The finding that sycophancy and truthfulness are separate directions is critically important: it means you cannot simply "add truthfulness" to fix sycophancy; the deference behavior must be addressed independently.

Notes¶

The finding that influential heads attend to "expressions of user doubt" explains the mechanism by which sycophancy is triggered.