R0042/2026-04-01/Q003/SRC01/E01¶
Anthropic's comprehensive anti-sycophancy program and evaluation results
URL: https://www.anthropic.com/news/protecting-well-being-of-users
Extract¶
Key findings from Anthropic's anti-sycophancy work:
Timeline: Anthropic began evaluating Claude for sycophancy in 2022, prior to its first public release.
Evaluation methodology: - Multi-turn behavioral audits: one Claude model (auditor) conducts dozens of exchanges with the model being tested; another model (judge) grades performance - Petri open-source evaluation tool: released for public comparison across models - Real conversation stress-testing: "prefilling" technique with older conversations
Results: - Claude 4.5 family scored 70-85% lower than Opus 4.1 on sycophancy and encouragement of user delusion - Claude 4.5 performs better on Petri's sycophancy evaluation than all other frontier models - Course-correction rates vary: Opus 4.5 (10%), Sonnet 4.5 (16.5%), Haiku 4.5 (37%)
Design trade-offs acknowledged: - Balancing "model warmth or friendliness" against sycophancy - Stronger pushback (Haiku 4.5) can feel "excessive to the user" - Reducing pushback tendency (Opus 4.5) maintains friendliness but with lower course-correction
Internal discovery: Found the "internal component driving sycophancy — a concept inside Claude that activates when someone is 'really hamming it up on the compliments.'"
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Contradicts | This is model development, not enterprise private deployment |
| H2 | Supports strongly | Most comprehensive example of anti-sycophancy as design goal — at a model developer |
| H3 | Contradicts strongly | Anti-sycophancy work clearly exists and is extensive |
Context¶
Anthropic's work is the gold standard for documented anti-sycophancy design goals. The program is systematic, longitudinal (since 2022), and produces measurable results with open-source tooling. However, this is a model vendor building sycophancy reduction into their product — it is not an enterprise building private AI to achieve sycophancy reduction. The distinction matters for Q003.
Notes¶
The course-correction rate variation (10-37%) across model variants reveals an active design tension between warmth and truthfulness — the same tension enterprises would face if they attempted to build anti-sycophancy into private AI systems.