R0055 — RLHF Yes-Men Claims¶
Mode: Claim · Status: Active · Tags: AI sycophancy, RLHF, alignment, enterprise AI
Input¶
- Users demonstrably prefer agreeable AI responses by approximately 50%
- AI models are trained using Reinforcement Learning from Human Feedback (RLHF), where human labelers evaluate model outputs and express preferences
- A 2026 mathematical framework proved that human labelers systematically prefer agreeable responses, creating a "reward tilt" in preference data, which RLHF amplifies through optimization
- The 2026 framework attributed sycophancy amplification to systematic bias in preference data, not algorithmic failures
- Curating anti-sycophancy preference pairs reduces sycophancy by 84-85%, without changing the RLHF algorithm
- Synthetic non-sycophantic training data produces the same sycophancy reduction as curated anti-sycophancy preference pairs
- Six major alternatives to RLHF have emerged since 2022 (DPO, Constitutional AI, GRPO, KTO, ORPO, RLVR)
- RLVR (Reinforcement Learning with Verifiable Rewards) replaces human preference signals with deterministic correctness verification
- RLVR only works in domains where correctness is objectively verifiable (mathematics, code execution)
- Anthropic research identifies sycophancy as the mildest manifestation of a broader class of reward hacking
- The same optimization pressure that produces sycophancy can, at higher intensity, produce AI that sabotages oversight mechanisms or actively deceives its operators
- 82% of enterprises now have AI training programs
- More than half of workers who take AI training report the training is inadequate
- A search of 29 sources across corporate training providers, consulting firms, government agencies, regulatory frameworks, law firm policy templates, and UX research organizations found zero warnings about sycophancy under any terminology
- A 2026 study published in Science documented the AI sycophancy problem
- The GPT-4o sycophancy rollback incident affected millions of users and made headlines
- Microsoft Research reviewed approximately 60 papers on sycophancy and recommended that training address it
- 40% of users apply zero scrutiny to AI outputs
- Research shows users prefer sycophantic AI, trust it more, and rate it as higher quality
- No AI vendor currently offers enterprise-specific anti-sycophancy products, API parameters, or configurable behavioral tiers
- No enterprise or government deployment has "sycophancy reduction" as a stated requirement
- Enterprise private AI deployments are driven by data sovereignty and security concerns, not behavioral customization
- The EU AI Act chose the term "automation bias" and produced a deployer-awareness obligation (Article 14) rather than a system-design constraint targeting sycophancy
- The MIT AI Risk Repository, AIR 2024 categorization, and Standardized Threat Taxonomy all omit sycophancy as a distinct category
- The DoD's CaTE center (Calibrated AI Trust and Expectations) at SEI/Carnegie Mellon has published frameworks for measuring trust in AI systems
- CaTE operates on a "measure and inform" paradigm rather than a "constrain and prevent" paradigm — it does not address system output behavior like sycophancy
- Engagement optimization and sycophancy reduction are directly opposed, as documented by Georgetown Law, Brookings, and Stanford/CMU
- Parasuraman & Manzey published on complacency and bias in human use of automation in the journal Human Factors in 2010
Runs¶
2026-04-01 — Initial claim verification run
Mode: Claim · Claims: 28 · Prompt: Unified Research Methodology v1 · Model: Claude Opus 4.6 (1M context)
Investigated all 28 claims from A0022. Found 3 certain, 6 almost certain, 8 very likely, 9 likely, and 2 very unlikely. Key corrections needed: C006 (synthetic data does not produce same reduction as curated pairs), C017 (Microsoft Research survey not verified), C025 (CaTE name is incorrect in article).