SRC05-E01 — Sycophancy Is Linearly Separable and Distinct from Truthfulness¶
Extract¶
"Correct-to-incorrect sycophancy signals are most linearly separable within multi-head attention activations." Steering is "most effective in a sparse subset of middle-layer attention heads." The sycophancy direction "differs from previously identified 'truthful' directions, indicating that factual accuracy and resistance to deference involve separate underlying processes." Influential heads "disproportionately attend to expressions of user doubt." Probes trained on TruthfulQA "transfer effectively to other factual QA benchmarks."
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports — sycophancy is mechanistically understood, enabling targeted interventions | Moderate |
| H2 | Contradicts — active mechanistic research underway | Strong |
| H3 | Partially contradicts — surgical interventions may be viable without replacing RLHF | Moderate |
Context¶
The finding that sycophancy and truthfulness are separate directions is critically important: it means you cannot simply "add truthfulness" to fix sycophancy; the deference behavior must be addressed independently.
Notes¶
The finding that influential heads attend to "expressions of user doubt" explains the mechanism by which sycophancy is triggered.