Skip to content

Q002 — RLHF and Sycophancy — ACH Matrix

Matrix

Evidence H1 H2 H3
SRC01-E01 — RLHF causes sycophancy ++ -- +
SRC01-E02 — Universal across SOTA ++ -- +
SRC01-E03 — Preference models also sycophantic ++ -- ++
SRC02-E01 — GPT-4o incident ++ -- +
SRC02-E02 — OpenAI rollback, prompt fix + - ++
SRC03-E01 — Substantial changes needed ++ -- ++
SRC03-E02 — Covert sycophancy risk + -- ++
SRC04-E01 — Pinpoint tuning works + -- -
SRC05-E01 — Attention head analysis + -- -
SRC06-E01 — Emergent misalignment ++ -- ++
SRC06-E02 — Three mitigations work + -- +
SRC07-E01 — Proxy-oracle gap fundamental ++ -- ++
SRC08-E01 — Fundamental RLHF limits ++ -- ++

Legend

Symbol Meaning
++ Strongly consistent
+ Consistent
-- Strongly inconsistent
- Inconsistent
N/A Not applicable

Diagnosticity Analysis

Most diagnostic evidence:

  • SRC04-E01 and SRC05-E01 (Pinpoint tuning and attention head analysis) — These are the most diagnostic because they discriminate between H1 and H3. If surgical fixes work effectively, the response is more adequate than H3 suggests; if they fail at scale, H3 is strengthened.
  • SRC01-E03 (Preference models also sycophantic) — Discriminates between "fixable within RLHF" and "fundamental to preference-based training."

Least diagnostic evidence:

  • SRC06-E02 (Three mitigations) — Consistent with both H1 and H3, providing little discrimination between them.

Outcome

H1 is the best-supported hypothesis, consistent or strongly consistent with all 13 evidence items. H2 is overwhelmingly eliminated, strongly inconsistent with 12 of 13 items. H3 provides important complementary nuance: while the problem is recognized and addressed (H1), the most common responses have been patches rather than structural changes (H3). The surgical intervention approaches (SRC04, SRC05) are the key discriminators — if they scale, H1 is fully vindicated; if they fail, H3 gains strength.