Skip to content

R0057/2026-04-01/C004

Claim: Curating anti-sycophancy preference pairs — training data where the correct answer disagrees with the user — dramatically reduces sycophancy without changing the algorithm at all.

BLUF: Confirmed. Multiple studies demonstrate that data-level interventions reduce sycophancy. Shapira et al. derive a closed-form agreement penalty as a minimal reward correction. Wei et al. show synthetic data reduces sycophancy 4.7-10%.

Probability: Very likely (80-95%) | Confidence: High


Summary

Entity Description
Claim Definition Claim text, scope, status
Assessment Full analytical product with reasoning chain
ACH Matrix Evidence x hypotheses diagnosticity analysis
Self-Audit ROBIS-adapted 5-domain audit

Hypotheses

ID Hypothesis Status
H1 Data-level interventions effectively reduce sycophancy Supported
H2 The effect is real but 'dramatically' overstates the magnitude Not supported
H3 Data-level interventions do not reduce sycophancy Eliminated

Searches

ID Target Results Selected
S01 Anti-sycophancy preference pairs training data reduces sycophancy 10 1

Sources

Source Description Reliability Relevance
SRC01 Shapira et al. (2026) and Wei et al. (2023) — data-level sycophancy interventions High High

Revisit Triggers

  • If data-level interventions are shown to be ineffective at scale