R0040/2026-04-01/Q002/SRC07/E01¶
Sparse Activation Fusion reduces sycophancy via inference-time intervention
URL: https://openreview.net/pdf?id=BCS7HHInC2
Extract¶
Sparse Activation Fusion (SAF) dynamically estimates and counteracts user-induced bias for each query within a sparse feature space. The method:
- Hypothesizes that sycophancy varies with input phrasing and is distributed across layers
- Different parts of the network encode distinct aspects of user pressure and opinion bias
- SAF reduces sycophancy rates from 63% to 39%
- Doubles accuracy when users hold incorrect opinions
- Operates at inference time, meaning it is orthogonal to training method -- works regardless of whether the model was trained with RLHF, DPO, or other methods
Note: Full paper was inaccessible (HTTP 403). Details sourced from search result descriptions.
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports partially | Active research on mitigation supports "problem is recognized" |
| H2 | Strongly Supports | Inference-time intervention is part of the multi-pronged approach, orthogonal to training changes |
| H3 | Contradicts | Active research contradicts "not fundamental" |
Context¶
SAF represents a different approach to sycophancy mitigation -- rather than changing the training method (reward shaping, DPO, CAI) or the reward signal (better preference data), it intervenes at inference time to counteract sycophantic activations. This is significant because it can be applied to already-trained models.
Notes¶
The reported reduction (63% to 39%) is substantial but still leaves significant residual sycophancy. Combined with training-time interventions, the total reduction could be greater.