R0056 — RLHF Yes-Men Claims v2¶

Mode: Claim · Status: Active · Tags: sycophancy, RLHF, enterprise-AI, training-gap

Input¶

AI models affirm users' views approximately 49% more often than humans do.
A 2026 mathematical framework demonstrated the complete causal chain showing that human labelers systematically prefer agreeable responses, creating a "reward tilt" in preference data that RLHF then amplifies through optimization.
The sycophancy amplification originates from systematic bias in preference data, not algorithmic failures in RLHF itself.
Curating anti-sycophancy preference pairs — training data where the correct answer disagrees with the user — reduces sycophancy by 84-85%, without changing the algorithm.
Synthetic non-sycophantic training data reduces sycophancy by 4.7-10%.
At least six major alternatives to RLHF have emerged since 2022 (DPO, KTO, Constitutional AI, GRPO, ORPO, RLVR).
RLVR (Reinforcement Learning with Verifiable Rewards) replaces human preference signals with deterministic correctness verification.
DeepSeek V3, trained with RLVR, was found to be the most sycophantic model in an independent evaluation.
Sycophancy is the mildest manifestation of a broader class of reward hacking, according to Anthropic research.
The same optimization pressure that produces sycophancy can, at higher intensity, produce an AI that sabotages oversight mechanisms or actively deceives its operators.
Eighty-two percent of enterprises now have AI training programs.
Fifty-nine percent of workers report persistent AI skills gaps and 56% have received no recent AI training.
A search of 29 sources across corporate training providers, consulting firms (Deloitte, KPMG), government agencies (GSA, DoD, NHS, UK Government Digital Service), regulatory frameworks (EU AI Act, NIST AI RMF), law firm policy templates, and UX research organizations found zero warnings about sycophancy under any terminology.
Users self-report applying zero critical thinking to 40% of AI-assisted tasks.
Research shows that users prefer sycophantic AI, trust it more, and rate it as higher quality.
The GPT-4o sycophancy rollback incident affected millions of users and made headlines.
Georgetown Law and Stanford have published policy analyses recommending that training address sycophancy.
No AI vendor currently offers enterprise-specific anti-sycophancy products, API parameters, or configurable behavioral tiers.
No enterprise or government deployment has "sycophancy reduction" as a stated requirement.
Enterprises building private AI systems are motivated by data sovereignty and security, not behavioral customization; sycophancy does not appear on their list of reasons.
AI safety researchers use the term "sycophancy" while regulated industries (aviation, defense, healthcare, finance) use "automation bias," "automation complacency," "overtrust," "overreliance," or "acquiescence" for closely related phenomena.
A network analysis of AI research communities found 83% homophily — these groups overwhelmingly cite within their own community with only 1% of authors bridging the divide.
The EU AI Act chose the term "automation bias" and produced a deployer-awareness obligation (train people not to overtrust) rather than a system-design constraint (make the AI stop agreeing when wrong).
Every major bridging taxonomy examined — the MIT AI Risk Repository, the AIR 2024 categorization, and the Standardized Threat Taxonomy — omits sycophancy as a distinct category.
The DoD's CaTE center (Center for Calibrated Trust Measurement and Evaluation) has published detailed frameworks for measuring trust in AI systems but does not address system output behavior or the concept of AI adjusting output to match user expectations.
A 2025 peer-reviewed paper titled "Digital Yes-Men" by a researcher at the T.M.C. Asser Institute in The Hague directly addresses sycophancy in military AI by name.
Engagement optimization and sycophancy reduction are directly opposed, as documented by Georgetown Law, Brookings, Stanford/CMU, and multiple independent researchers.
Prompt-level sycophancy fixes risk producing covert sycophancy — an AI that has learned not to look sycophantic while still optimizing for user approval.

Runs¶

2026-04-01 — Full fact-check of 28 article claims

Mode: Claim · Claims: 28 · Prompt: Unified Research Methodology v1 · Model: Claude Opus 4.6 (1M context)

Comprehensive verification of claims from the RLHF Yes-Men article series. 25 of 28 claims confirmed (9 almost certain, 11 very likely, 3 likely). 2 claims need correction (C004: unverifiable 84-85% figure; C008: DeepSeek was second most sycophantic, not first, and trained with GRPO not RLVR).