Skip to content

R0040/2026-04-01/Q002/SRC07

Research R0040 — RLHF Alternatives
Run 2026-04-01
Query Q002
Search S04
Result S04-R01
Source SRC07

Sparse Activation Fusion (SAF) for sycophancy mitigation

Source

Field Value
Title Mitigating Sycophancy in Language Models via Sparse Activation Fusion
Publisher OpenReview
Author(s) Unknown (from search results)
Date 2025 (estimated)
URL https://openreview.net/pdf?id=BCS7HHInC2
Type Research paper

Summary

Dimension Rating
Reliability Medium
Relevance High
Bias: Missing data Some concerns
Bias: Measurement Low risk
Bias: Selective reporting Some concerns
Bias: Randomization N/A -- not an RCT
Bias: Protocol deviation N/A -- not an RCT
Bias: COI/Funding Low risk

Rationale

Dimension Rationale
Reliability OpenReview paper. Full text was not accessible (403 error), so details come from search result summaries.
Relevance Directly demonstrates a mechanistic approach to sycophancy mitigation.
Bias flags Limited access to full paper reduces confidence. Reported metrics are striking (63% to 39%) and may need verification.

Evidence Extracts

Evidence ID Summary
SRC07-E01 SAF reduces sycophancy from 63% to 39% via inference-time activation intervention