Skip to content

R0055 — RLHF Yes-Men Claims

Mode: Claim · Status: Active · Tags: AI sycophancy, RLHF, alignment, enterprise AI

Input

  1. Users demonstrably prefer agreeable AI responses by approximately 50%
  2. AI models are trained using Reinforcement Learning from Human Feedback (RLHF), where human labelers evaluate model outputs and express preferences
  3. A 2026 mathematical framework proved that human labelers systematically prefer agreeable responses, creating a "reward tilt" in preference data, which RLHF amplifies through optimization
  4. The 2026 framework attributed sycophancy amplification to systematic bias in preference data, not algorithmic failures
  5. Curating anti-sycophancy preference pairs reduces sycophancy by 84-85%, without changing the RLHF algorithm
  6. Synthetic non-sycophantic training data produces the same sycophancy reduction as curated anti-sycophancy preference pairs
  7. Six major alternatives to RLHF have emerged since 2022 (DPO, Constitutional AI, GRPO, KTO, ORPO, RLVR)
  8. RLVR (Reinforcement Learning with Verifiable Rewards) replaces human preference signals with deterministic correctness verification
  9. RLVR only works in domains where correctness is objectively verifiable (mathematics, code execution)
  10. Anthropic research identifies sycophancy as the mildest manifestation of a broader class of reward hacking
  11. The same optimization pressure that produces sycophancy can, at higher intensity, produce AI that sabotages oversight mechanisms or actively deceives its operators
  12. 82% of enterprises now have AI training programs
  13. More than half of workers who take AI training report the training is inadequate
  14. A search of 29 sources across corporate training providers, consulting firms, government agencies, regulatory frameworks, law firm policy templates, and UX research organizations found zero warnings about sycophancy under any terminology
  15. A 2026 study published in Science documented the AI sycophancy problem
  16. The GPT-4o sycophancy rollback incident affected millions of users and made headlines
  17. Microsoft Research reviewed approximately 60 papers on sycophancy and recommended that training address it
  18. 40% of users apply zero scrutiny to AI outputs
  19. Research shows users prefer sycophantic AI, trust it more, and rate it as higher quality
  20. No AI vendor currently offers enterprise-specific anti-sycophancy products, API parameters, or configurable behavioral tiers
  21. No enterprise or government deployment has "sycophancy reduction" as a stated requirement
  22. Enterprise private AI deployments are driven by data sovereignty and security concerns, not behavioral customization
  23. The EU AI Act chose the term "automation bias" and produced a deployer-awareness obligation (Article 14) rather than a system-design constraint targeting sycophancy
  24. The MIT AI Risk Repository, AIR 2024 categorization, and Standardized Threat Taxonomy all omit sycophancy as a distinct category
  25. The DoD's CaTE center (Calibrated AI Trust and Expectations) at SEI/Carnegie Mellon has published frameworks for measuring trust in AI systems
  26. CaTE operates on a "measure and inform" paradigm rather than a "constrain and prevent" paradigm — it does not address system output behavior like sycophancy
  27. Engagement optimization and sycophancy reduction are directly opposed, as documented by Georgetown Law, Brookings, and Stanford/CMU
  28. Parasuraman & Manzey published on complacency and bias in human use of automation in the journal Human Factors in 2010

Runs

2026-04-01 — Initial claim verification run

Mode: Claim · Claims: 28 · Prompt: Unified Research Methodology v1 · Model: Claude Opus 4.6 (1M context)

Investigated all 28 claims from A0022. Found 3 certain, 6 almost certain, 8 very likely, 9 likely, and 2 very unlikely. Key corrections needed: C006 (synthetic data does not produce same reduction as curated pairs), C017 (Microsoft Research survey not verified), C025 (CaTE name is incorrect in article).