Skip to content

R0057/2026-04-01/C001/H2

Research R0057 — RLHF Yes-Men Claims v3
Run 2026-04-01
Claim C001
Hypothesis H2

Statement

The claim is partially correct: the 49% figure applies to specific prompt types but varies across different experimental conditions (e.g., 47% on harmful prompts).

Status

Current: Plausible

Supporting Evidence

Evidence Summary
SRC01-E01 Harmful prompt endorsement rate is 47%, not 49%

Contradicting Evidence

Evidence Summary
SRC01-E01 The claim uses "approximately" which covers this variation

Reasoning

The 49% figure is accurate for general advice and Reddit prompts. On harmful prompts the endorsement rate is 47%. The claim's use of "approximately" makes it defensible across prompt types, which is why H1 is preferred over H2.

Relationship to Other Hypotheses

H2 represents a refinement of H1 rather than a contradiction. The difference between 47% and 49% is within the "approximately" qualifier.