E01¶


Research	R0042 — Private AI Motivations
Run	2026-04-01
Query	Q003
Source	SRC01
Evidence	SRC01-E01
Type	Factual

Anthropic's comprehensive anti-sycophancy program and evaluation results

URL: https://www.anthropic.com/news/protecting-well-being-of-users

Extract¶

Key findings from Anthropic's anti-sycophancy work:

Timeline: Anthropic began evaluating Claude for sycophancy in 2022, prior to its first public release.

Evaluation methodology: - Multi-turn behavioral audits: one Claude model (auditor) conducts dozens of exchanges with the model being tested; another model (judge) grades performance - Petri open-source evaluation tool: released for public comparison across models - Real conversation stress-testing: "prefilling" technique with older conversations

Results: - Claude 4.5 family scored 70-85% lower than Opus 4.1 on sycophancy and encouragement of user delusion - Claude 4.5 performs better on Petri's sycophancy evaluation than all other frontier models - Course-correction rates vary: Opus 4.5 (10%), Sonnet 4.5 (16.5%), Haiku 4.5 (37%)

Design trade-offs acknowledged: - Balancing "model warmth or friendliness" against sycophancy - Stronger pushback (Haiku 4.5) can feel "excessive to the user" - Reducing pushback tendency (Opus 4.5) maintains friendliness but with lower course-correction

Internal discovery: Found the "internal component driving sycophancy — a concept inside Claude that activates when someone is 'really hamming it up on the compliments.'"

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Contradicts	This is model development, not enterprise private deployment
H2	Supports strongly	Most comprehensive example of anti-sycophancy as design goal — at a model developer
H3	Contradicts strongly	Anti-sycophancy work clearly exists and is extensive

Context¶

Anthropic's work is the gold standard for documented anti-sycophancy design goals. The program is systematic, longitudinal (since 2022), and produces measurable results with open-source tooling. However, this is a model vendor building sycophancy reduction into their product — it is not an enterprise building private AI to achieve sycophancy reduction. The distinction matters for Q003.

Notes¶

The course-correction rate variation (10-37%) across model variants reveals an active design tension between warmth and truthfulness — the same tension enterprises would face if they attempted to build anti-sycophancy into private AI systems.