Skip to content

R0042/2026-04-01/Q003/SRC01/E01

Research R0042 — Private AI Motivations
Run 2026-04-01
Query Q003
Source SRC01
Evidence SRC01-E01
Type Factual

Anthropic's comprehensive anti-sycophancy program and evaluation results

URL: https://www.anthropic.com/news/protecting-well-being-of-users

Extract

Key findings from Anthropic's anti-sycophancy work:

Timeline: Anthropic began evaluating Claude for sycophancy in 2022, prior to its first public release.

Evaluation methodology: - Multi-turn behavioral audits: one Claude model (auditor) conducts dozens of exchanges with the model being tested; another model (judge) grades performance - Petri open-source evaluation tool: released for public comparison across models - Real conversation stress-testing: "prefilling" technique with older conversations

Results: - Claude 4.5 family scored 70-85% lower than Opus 4.1 on sycophancy and encouragement of user delusion - Claude 4.5 performs better on Petri's sycophancy evaluation than all other frontier models - Course-correction rates vary: Opus 4.5 (10%), Sonnet 4.5 (16.5%), Haiku 4.5 (37%)

Design trade-offs acknowledged: - Balancing "model warmth or friendliness" against sycophancy - Stronger pushback (Haiku 4.5) can feel "excessive to the user" - Reducing pushback tendency (Opus 4.5) maintains friendliness but with lower course-correction

Internal discovery: Found the "internal component driving sycophancy — a concept inside Claude that activates when someone is 'really hamming it up on the compliments.'"

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Contradicts This is model development, not enterprise private deployment
H2 Supports strongly Most comprehensive example of anti-sycophancy as design goal — at a model developer
H3 Contradicts strongly Anti-sycophancy work clearly exists and is extensive

Context

Anthropic's work is the gold standard for documented anti-sycophancy design goals. The program is systematic, longitudinal (since 2022), and produces measurable results with open-source tooling. However, this is a model vendor building sycophancy reduction into their product — it is not an enterprise building private AI to achieve sycophancy reduction. The distinction matters for Q003.

Notes

The course-correction rate variation (10-37%) across model variants reveals an active design tension between warmth and truthfulness — the same tension enterprises would face if they attempted to build anti-sycophancy into private AI systems.