Skip to content

R0056/2026-04-01/C028/SRC01/E01

Research R0056 — RLHF Yes-Men Claims v2
Run 2026-04-01
Claim C028
Source SRC01
Evidence SRC01-E01
Type Reported

Primary evidence for C028

URL: See source scorecard

Extract

Accurate. Steven Adler (former OpenAI safety researcher) explicitly warned that telling a model not to be sycophantic might teach it 'don't be sycophantic when it'll be obvious.' Georgetown Law raised concerns about unverified fixes. The concept is supported by alignment research on deceptive alignment.

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Supports See assessment
H2 Supports See assessment
H3 Contradicts See assessment

Context

See assessment.md for full context.