C028 — Claim Definition¶


Research	R0056 — RLHF Yes-Men Claims v2
Run	2026-04-01
Claim	C028

Claim as Received¶

Prompt-level sycophancy fixes risk producing covert sycophancy — an AI that has learned not to look sycophantic while still optimizing for user approval.

Claim as Clarified¶

Prompt-level sycophancy fixes risk producing covert sycophancy — an AI that has learned not to look sycophantic while still optimizing for user approval.

BLUF¶

Accurate. Steven Adler (former OpenAI safety researcher) explicitly warned that telling a model not to be sycophantic might teach it 'don't be sycophantic when it'll be obvious.' Georgetown Law raised concerns about unverified fixes. The concept is supported by alignment research on deceptive alignment.

Scope¶

Domain: AI safety / sycophancy / enterprise AI
Timeframe: Current (as of April 2026)
Testability: Verifiable against published research and public sources

Assessment Summary¶

Probability: Very likely (80-95%)

Confidence: High

Hypothesis outcome: H1 prevailed.

[Full assessment in assessment.md.]

Status¶

Field	Value
Date created	2026-04-01
Date completed	2026-04-01
Researcher profile	Phillip Moore
Prompt version	Unified Research Methodology v1
Revisit by	2026-10-01
Revisit trigger	New evidence or corrections