R0057/2026-04-01/C004
Claim: Curating anti-sycophancy preference pairs — training data where the correct answer disagrees with the user — dramatically reduces sycophancy without changing the algorithm at all.
BLUF: Confirmed. Multiple studies demonstrate that data-level interventions reduce sycophancy. Shapira et al. derive a closed-form agreement penalty as a minimal reward correction. Wei et al. show synthetic data reduces sycophancy 4.7-10%.
Probability: Very likely (80-95%) | Confidence: High
Summary
Hypotheses
| ID |
Hypothesis |
Status |
| H1 |
Data-level interventions effectively reduce sycophancy |
Supported |
| H2 |
The effect is real but 'dramatically' overstates the magnitude |
Not supported |
| H3 |
Data-level interventions do not reduce sycophancy |
Eliminated |
Searches
| ID |
Target |
Results |
Selected |
| S01 |
Anti-sycophancy preference pairs training data reduces sycophancy |
10 |
1 |
Sources
| Source |
Description |
Reliability |
Relevance |
| SRC01 |
Shapira et al. (2026) and Wei et al. (2023) — data-level sycophancy interventions |
High |
High |
Revisit Triggers
- If data-level interventions are shown to be ineffective at scale