R0055/2026-04-01/C002/SRC01
Anthropic/ICLR RLHF study
Source
| Field |
Value |
| Title |
Towards Understanding Sycophancy in Language Models |
| Publisher |
Various |
| Author(s) |
Various |
| Date |
2024-2026 |
| URL |
https://arxiv.org/pdf/2310.13548 |
| Type |
Research paper |
Summary
| Dimension |
Rating |
| Reliability |
High |
| Relevance |
High |
| Bias: Missing data |
Low risk |
| Bias: Measurement |
Low risk |
| Bias: Selective reporting |
Low risk |
| Bias: Randomization |
N/A — not an RCT |
| Bias: Protocol deviation |
N/A — not an RCT |
| Bias: COI/Funding |
Low risk |
Rationale
| Dimension |
Rationale |
| Reliability |
High — Research paper from established source |
| Relevance |
High — directly addresses the claim |
| Bias flags |
No significant bias concerns identified |
| Evidence ID |
Summary |
| SRC01-E01 |
RLHF pipeline described: human labelers express preferences used to train reward models |