Skip to content

R0040/2026-04-01/Q002/SRC07/E01

Research R0040 — RLHF Alternatives
Run 2026-04-01
Query Q002
Source SRC07
Evidence SRC07-E01
Type Reported

Sparse Activation Fusion reduces sycophancy via inference-time intervention

URL: https://openreview.net/pdf?id=BCS7HHInC2

Extract

Sparse Activation Fusion (SAF) dynamically estimates and counteracts user-induced bias for each query within a sparse feature space. The method:

  • Hypothesizes that sycophancy varies with input phrasing and is distributed across layers
  • Different parts of the network encode distinct aspects of user pressure and opinion bias
  • SAF reduces sycophancy rates from 63% to 39%
  • Doubles accuracy when users hold incorrect opinions
  • Operates at inference time, meaning it is orthogonal to training method -- works regardless of whether the model was trained with RLHF, DPO, or other methods

Note: Full paper was inaccessible (HTTP 403). Details sourced from search result descriptions.

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Supports partially Active research on mitigation supports "problem is recognized"
H2 Strongly Supports Inference-time intervention is part of the multi-pronged approach, orthogonal to training changes
H3 Contradicts Active research contradicts "not fundamental"

Context

SAF represents a different approach to sycophancy mitigation -- rather than changing the training method (reward shaping, DPO, CAI) or the reward signal (better preference data), it intervenes at inference time to counteract sycophantic activations. This is significant because it can be applied to already-trained models.

Notes

The reported reduction (63% to 39%) is substantial but still leaves significant residual sycophancy. Combined with training-time interventions, the total reduction could be greater.