R0040/2026-04-01/Q001/SRC07
BlueDot -- Problems with RLHF for AI Safety
Source
Summary
| Dimension |
Rating |
| Reliability |
Medium |
| Relevance |
Medium |
| Bias: Missing data |
Low risk |
| Bias: Measurement |
N/A |
| Bias: Selective reporting |
Some concerns |
| Bias: Randomization |
N/A -- not an RCT |
| Bias: Protocol deviation |
N/A -- not an RCT |
| Bias: COI/Funding |
Low risk |
Rationale
| Dimension |
Rationale |
| Reliability |
Well-sourced analysis from an AI safety organization. Not peer-reviewed but cites primary research. |
| Relevance |
Provides motivation for alternatives by documenting RLHF failure modes. |
| Bias flags |
Safety-focused organization may overemphasize failure modes. Some selective reporting concern. |
| Evidence ID |
Summary |
| SRC07-E01 |
Seven critical RLHF limitations including sycophancy and reward hacking |