S05¶


Research	R0041 — Enterprise Sycophancy
Run	2026-04-01
Query	Q001
Search	S05

WebSearch — Sycophancy benchmarks and evaluation tools

Summary¶

Field	Value
Source/Database	WebSearch (two queries combined)
Query terms	(1) Anthropic sycophancy evaluation benchmark model card 2024 2025; (2) "syco-bench" benchmark AI sycophancy measurement evaluation tool
Filters	None
Results returned	20
Results selected	4
Results rejected	16

Selected Results¶

Result	Title	URL	Rationale
S05-R01	Bloom: an open source tool for automated behavioral evaluations	https://alignment.anthropic.com/2025/bloom-auto-evals/	Primary source on Anthropic's sycophancy evaluation tool
S05-R02	syco-bench: A benchmark for LLM Sycophancy	https://www.syco-bench.com/	Independent sycophancy benchmark
S05-R03	SYCON-Bench (Findings of EMNLP 2025)	https://github.com/JiseungHong/SYCON-Bench	Peer-reviewed multi-turn sycophancy benchmark
S05-R04	ELEPHANT: Measuring and understanding social sycophancy in LLMs	https://arxiv.org/abs/2505.13995	Research paper on social sycophancy measurement

Rejected Results¶

Result	Title	URL	Rationale
S05-R05	Findings from Anthropic-OpenAI Alignment Evaluation Exercise	https://alignment.anthropic.com/2025/openai-findings/	Cross-lab evaluation, not sycophancy-specific
S05-R06	Measuring Sycophancy in Multi-turn Dialogues	https://aclanthology.org/2025.findings-emnlp.121.pdf	Same as SYCON-Bench (duplicate)
S05-R07	OpenAI-Anthropic Safety Evaluation (OpenAI side)	https://openai.com/index/openai-anthropic-safety-evaluation/	General safety evaluation exercise
S05-R08	Anthropic Summer 2025 Sabotage Risk Report	https://alignment.anthropic.com/2025/sabotage-risk-report/2025_pilot_risk_report.pdf	Sabotage risk, not sycophancy
S05-R09	Testing AI Models with Bloom	https://www.softwaretestingmagazine.com/news/testing-ai-models-with-anthropic-bloom-open-source-tool/	Secondary reporting of Bloom
S05-R10	Anthropic Transparency Hub Model Report	https://www.anthropic.com/transparency/model-report	General transparency, not sycophancy-specific
S05-R11	Alignment Science Blog	https://alignment.anthropic.com/	Blog index page
S05-R12	syco-bench PDF paper	https://www.syco-bench.com/syco-bench.pdf	Same source as R02, different format
S05-R13	GitHub - syco-bench	https://github.com/timfduffy/syco-bench	Repository for R02 benchmark
S05-R14	Measuring Sycophancy in Multi-turn (arxiv)	https://arxiv.org/abs/2505.23840	Overlaps with SYCON-Bench
S05-R15	syco-bench Substack post	https://timfduffy.substack.com/p/syco-bench-a-simple-benchmark-of	Blog post about R02, secondary
S05-R16	SycEval: Evaluating LLM Sycophancy	https://arxiv.org/pdf/2502.08177	Additional benchmark, not selected due to scope
S05-R17	SycBench	https://www.sycbench.org/	Another benchmark variant
S05-R18	SycEval (Semantic Scholar)	https://www.semanticscholar.org/paper/SycEval:-Evaluating-LLM-Sycophancy-Fanous-Goldberg/796f0ce165479e22f95c9f8d02b1b239816f46ef	Duplicate of R16
S05-R19	Sycophancy Is Not One Thing (arxiv)	https://arxiv.org/html/2509.21305v1	Interesting but not directly addressing enterprise products
S05-R20	Sycophantic AI decreases prosocial intentions (Science)	https://www.science.org/doi/10.1126/science.aec8352	Used as source in Q002 instead

Notes¶

The benchmark landscape for sycophancy measurement is rapidly developing. Multiple independent tools now exist, suggesting the field is maturing past the "we know it's a problem" phase into systematic measurement. However, none of these are positioned as enterprise evaluation tools — they are all research instruments.