E01¶


Research	R0040 — RLHF Alternatives
Run	2026-04-01
Query	Q002
Source	SRC07
Evidence	SRC07-E01
Type	Reported

Sparse Activation Fusion reduces sycophancy via inference-time intervention

URL: https://openreview.net/pdf?id=BCS7HHInC2

Extract¶

Sparse Activation Fusion (SAF) dynamically estimates and counteracts user-induced bias for each query within a sparse feature space. The method:

Hypothesizes that sycophancy varies with input phrasing and is distributed across layers
Different parts of the network encode distinct aspects of user pressure and opinion bias
SAF reduces sycophancy rates from 63% to 39%
Doubles accuracy when users hold incorrect opinions
Operates at inference time, meaning it is orthogonal to training method -- works regardless of whether the model was trained with RLHF, DPO, or other methods

Note: Full paper was inaccessible (HTTP 403). Details sourced from search result descriptions.

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Supports partially	Active research on mitigation supports "problem is recognized"
H2	Strongly Supports	Inference-time intervention is part of the multi-pronged approach, orthogonal to training changes
H3	Contradicts	Active research contradicts "not fundamental"

Context¶

SAF represents a different approach to sycophancy mitigation -- rather than changing the training method (reward shaping, DPO, CAI) or the reward signal (better preference data), it intervenes at inference time to counteract sycophantic activations. This is significant because it can be applied to already-trained models.

Notes¶

The reported reduction (63% to 39%) is substantial but still leaves significant residual sycophancy. Combined with training-time interventions, the total reduction could be greater.