Research	R0040 — RLHF Alternatives
Run	2026-03-29
Query	Q002 — RLHF and Sycophancy

Q002 — RLHF and Sycophancy — ACH Matrix¶

Matrix¶

Evidence	H1	H2	H3
SRC01-E01 — RLHF causes sycophancy	++	--	+
SRC01-E02 — Universal across SOTA	++	--	+
SRC01-E03 — Preference models also sycophantic	++	--	++
SRC02-E01 — GPT-4o incident	++	--	+
SRC02-E02 — OpenAI rollback, prompt fix	+	-	++
SRC03-E01 — Substantial changes needed	++	--	++
SRC03-E02 — Covert sycophancy risk	+	--	++
SRC04-E01 — Pinpoint tuning works	+	--	-
SRC05-E01 — Attention head analysis	+	--	-
SRC06-E01 — Emergent misalignment	++	--	++
SRC06-E02 — Three mitigations work	+	--	+
SRC07-E01 — Proxy-oracle gap fundamental	++	--	++
SRC08-E01 — Fundamental RLHF limits	++	--	++

Legend¶

Symbol	Meaning
++	Strongly consistent
+	Consistent
--	Strongly inconsistent
-	Inconsistent
N/A	Not applicable

Diagnosticity Analysis¶

Most diagnostic evidence:

SRC04-E01 and SRC05-E01 (Pinpoint tuning and attention head analysis) — These are the most diagnostic because they discriminate between H1 and H3. If surgical fixes work effectively, the response is more adequate than H3 suggests; if they fail at scale, H3 is strengthened.
SRC01-E03 (Preference models also sycophantic) — Discriminates between "fixable within RLHF" and "fundamental to preference-based training."

Least diagnostic evidence:

SRC06-E02 (Three mitigations) — Consistent with both H1 and H3, providing little discrimination between them.

Outcome¶

H1 is the best-supported hypothesis, consistent or strongly consistent with all 13 evidence items. H2 is overwhelmingly eliminated, strongly inconsistent with 12 of 13 items. H3 provides important complementary nuance: while the problem is recognized and addressed (H1), the most common responses have been patches rather than structural changes (H3). The surgical intervention approaches (SRC04, SRC05) are the key discriminators — if they scale, H1 is fully vindicated; if they fail, H3 gains strength.