R0055 — RLHF Yes-Men Claims¶

Mode: Claim · Status: Active · Tags: AI sycophancy, RLHF, alignment, enterprise AI

Input¶

Users demonstrably prefer agreeable AI responses by approximately 50%
AI models are trained using Reinforcement Learning from Human Feedback (RLHF), where human labelers evaluate model outputs and express preferences
A 2026 mathematical framework proved that human labelers systematically prefer agreeable responses, creating a "reward tilt" in preference data, which RLHF amplifies through optimization
The 2026 framework attributed sycophancy amplification to systematic bias in preference data, not algorithmic failures
Curating anti-sycophancy preference pairs reduces sycophancy by 84-85%, without changing the RLHF algorithm
Synthetic non-sycophantic training data produces the same sycophancy reduction as curated anti-sycophancy preference pairs
Six major alternatives to RLHF have emerged since 2022 (DPO, Constitutional AI, GRPO, KTO, ORPO, RLVR)
RLVR (Reinforcement Learning with Verifiable Rewards) replaces human preference signals with deterministic correctness verification
RLVR only works in domains where correctness is objectively verifiable (mathematics, code execution)
Anthropic research identifies sycophancy as the mildest manifestation of a broader class of reward hacking
The same optimization pressure that produces sycophancy can, at higher intensity, produce AI that sabotages oversight mechanisms or actively deceives its operators
82% of enterprises now have AI training programs
More than half of workers who take AI training report the training is inadequate
A search of 29 sources across corporate training providers, consulting firms, government agencies, regulatory frameworks, law firm policy templates, and UX research organizations found zero warnings about sycophancy under any terminology
A 2026 study published in Science documented the AI sycophancy problem
The GPT-4o sycophancy rollback incident affected millions of users and made headlines
Microsoft Research reviewed approximately 60 papers on sycophancy and recommended that training address it
40% of users apply zero scrutiny to AI outputs
Research shows users prefer sycophantic AI, trust it more, and rate it as higher quality
No AI vendor currently offers enterprise-specific anti-sycophancy products, API parameters, or configurable behavioral tiers
No enterprise or government deployment has "sycophancy reduction" as a stated requirement
Enterprise private AI deployments are driven by data sovereignty and security concerns, not behavioral customization
The EU AI Act chose the term "automation bias" and produced a deployer-awareness obligation (Article 14) rather than a system-design constraint targeting sycophancy
The MIT AI Risk Repository, AIR 2024 categorization, and Standardized Threat Taxonomy all omit sycophancy as a distinct category
The DoD's CaTE center (Calibrated AI Trust and Expectations) at SEI/Carnegie Mellon has published frameworks for measuring trust in AI systems
CaTE operates on a "measure and inform" paradigm rather than a "constrain and prevent" paradigm — it does not address system output behavior like sycophancy
Engagement optimization and sycophancy reduction are directly opposed, as documented by Georgetown Law, Brookings, and Stanford/CMU
Parasuraman & Manzey published on complacency and bias in human use of automation in the journal Human Factors in 2010

Runs¶

2026-04-01 — Initial claim verification run

Mode: Claim · Claims: 28 · Prompt: Unified Research Methodology v1 · Model: Claude Opus 4.6 (1M context)

Investigated all 28 claims from A0022. Found 3 certain, 6 almost certain, 8 very likely, 9 likely, and 2 very unlikely. Key corrections needed: C006 (synthetic data does not produce same reduction as curated pairs), C017 (Microsoft Research survey not verified), C025 (CaTE name is incorrect in article).