R0054/2026-03-31/C003/SRC01
Anthropic's primary research on sycophancy in language models (ICLR 2024).
Source
Summary
| Dimension |
Rating |
| Reliability |
High |
| Relevance |
High |
| Bias: Missing data |
Low risk |
| Bias: Measurement |
Low risk |
| Bias: Selective reporting |
Low risk |
| Bias: Randomization |
N/A -- not an RCT |
| Bias: Protocol deviation |
N/A -- not an RCT |
| Bias: COI/Funding |
Some concerns |
Rationale
| Dimension |
Rationale |
| Reliability |
Published at ICLR 2024 (top ML venue). Rigorous experimental methodology. |
| Relevance |
Directly addresses the root cause of the claimed behavior — RLHF-driven sycophancy. |
| Bias flags |
COI concern: Anthropic researching its own models. However, the findings are critical (exposing model weaknesses), which mitigates self-interest bias. |
| Evidence ID |
Summary |
| SRC01-E01 |
Sycophancy is systematic RLHF-driven behavior; models prioritize user alignment over accuracy |