C004¶


Research	R0056 — RLHF Yes-Men Claims v2
Run	2026-04-01
Claim	C004

Claim: Curating anti-sycophancy preference pairs — training data where the correct answer disagrees with the user — reduces sycophancy by 84-85%, without changing the algorithm.

BLUF: Not verified. The specific 84-85% reduction figure could not be found in any referenced paper. The 84% figure in Anthropic's paper refers to model knowledge of misconceptions, not sycophancy reduction.

Probability: Unlikely (20-45%) | Confidence: Medium

Summary¶

Entity	Description
Claim Definition	Claim text, scope, status
Assessment	Full analytical product with reasoning chain
ACH Matrix	Evidence x hypotheses diagnosticity analysis
Self-Audit	ROBIS-adapted 5-domain audit

Hypotheses¶

ID	Hypothesis	Status
H1	Claim is accurate	Inconclusive
H2	Partially correct	Inconclusive
H3	Materially wrong — figure not found	Supported

Searches¶

ID	Target	Results	Selected
S01	Evidence for claim	10	2

Sources¶

Source	Description	Reliability	Relevance
SRC01	Anthropic sycophancy research	High	Medium

Revisit Triggers¶

New evidence or corrections to cited sources
Replication or refutation of key findings