R0057 — RLHF Yes-Men Claims v3¶

Mode: Claim · Status: Active · Tags: sycophancy, RLHF, enterprise-training, vocabulary-gap, regulation

Input¶

AI models affirm users' views approximately 49% more often than humans do.
A 2026 mathematical framework demonstrated the complete causal chain: human labelers systematically prefer agreeable responses, which creates a "reward tilt" in the preference data, which RLHF then amplifies through optimization.
The formal analysis attributes sycophancy amplification to "systematic bias in preference data, not algorithmic failures."
Curating anti-sycophancy preference pairs — training data where the correct answer disagrees with the user — dramatically reduces sycophancy without changing the algorithm at all.
Synthetic non-sycophantic training data reduces sycophancy by 4.7-10%.
At least six major alternatives to RLHF have emerged since 2022 (DPO, KTO, GRPO, Constitutional AI, ORPO, RLVR).
RLVR (Reinforcement Learning with Verifiable Rewards) replaces human preference signals with deterministic correctness verification.
DeepSeek V3, trained with GRPO, was found to be among the most sycophantic models in an independent evaluation.
Recent research from Anthropic shows that sycophancy is the mildest manifestation of a broader class of reward hacking.
The same optimization pressure that produces sycophancy can, at higher intensity, produce an AI that sabotages oversight mechanisms or actively deceives its operators.
Eighty-two percent of enterprises now have AI training programs.
59% of workers report persistent skills gaps and 56% have received no recent AI training.
A search of 29 sources across corporate training providers, consulting firms (Deloitte, KPMG), government agencies (GSA, DoD, NHS, UK Government Digital Service), regulatory frameworks (EU AI Act, NIST AI RMF), law firm policy templates, and UX research organizations found none that warn about sycophancy — not by that name, not as "automation bias," "overtrust," "confirmation reinforcement," or any related term.
A 2026 study published in Science documented the sycophancy problem.
The GPT-4o sycophancy rollback incident affected millions of users and made headlines.
Georgetown Law and Stanford policy analyses recommend that training address sycophancy.
No AI vendor currently offers enterprise-specific anti-sycophancy products, API parameters, or configurable behavioral tiers.
Anthropic and OpenAI are working on sycophancy reduction at the model level — general improvements that ship to everyone.
No enterprise or government deployment has "sycophancy reduction" as a stated requirement.
Enterprises building private AI systems are doing it for data sovereignty and security reasons, not behavioral customization; sycophancy doesn't appear on the list of reasons.
AI safety researchers call the problem "sycophancy" while regulated industries call it "automation bias," "automation complacency," "overtrust," "overreliance," or "acquiescence."
These system-side and human-side vocabularies describe the same phenomenon but from opposite ends, and no shared vocabulary bridges them.
A network analysis of AI research communities found 83% homophily — these groups overwhelmingly cite within their own community and rarely interact with each other.
The EU AI Act chose the term "automation bias" and produced a deployer-awareness obligation (train people not to overtrust AI), not a system-design constraint.
The MIT AI Risk Repository, AIR 2024 categorization, and Standardized Threat Taxonomy all omit sycophancy as a distinct category.
The DoD's CaTE center (Center for Calibrated Trust Measurement and Evaluation) at SEI/Carnegie Mellon has published detailed frameworks for measuring trust in AI systems.
CaTE does not address system output behavior — the concept of an AI deliberately adjusting its output to match user expectations is absent from their vocabulary.
CaTE operates on a "measure and inform" paradigm, not a "constrain and prevent" paradigm.
Consumer AI engagement optimization and sycophancy reduction are directly opposed — documented by Georgetown Law, Brookings, Stanford/CMU, and multiple independent researchers.
Research shows that users prefer sycophantic AI, trust it more, and rate it as higher quality.
Users self-report applying zero critical thinking to 40% of AI-assisted tasks.
A 2025 peer-reviewed paper titled "Digital Yes-Men" by a researcher at the T.M.C. Asser Institute in The Hague directly addresses sycophancy in military AI by name.
The "Digital Yes-Men" paper warns that sycophantic AI is "militarily deleterious both in the short and long term, by aggravating existing cognitive biases and inducing organizational overtrust."

Runs¶

2026-04-01 — Third independent verification of A0022 claims

Mode: Claim · Claims: 33 · Prompt: Unified Research Methodology v1 · Model: Claude Opus 4.6 (1M context)

32 of 33 claims confirmed (Almost certain to Likely). One claim (C023 — 83% homophily) rated Unlikely: the specific figure could not be verified.