Q002 — ACH Matrix¶


Research	R0040 — RLHF Alternatives
Run	2026-03-28
Query	Q002

Matrix¶

	H1: RLHF is primary cause, driving change	H2: Not attributed to RLHF	H3: One factor, multi-pronged response
SRC01-E01: RLHF models exhibit sycophancy "driven in part" by preferences	+	--	++
SRC02-E01: Causal chain: data bias -> reward tilt -> amplification	++	--	++
SRC03-E01: Four-cause taxonomy; multi-faceted mitigation needed	-	-	++
SRC04-E01: GPT-4o sycophancy from RLHF reward signal imbalance	++	--	+
SRC05-E01: DPO + anti-sycophancy data: 84-85% reduction	+	--	++
SRC06-E01: Synthetic data reduces sycophancy without algorithm change	N/A	-	++

Legend: - ++ Strongly supports - + Supports - -- Strongly contradicts - - Contradicts - N/A Not applicable to this hypothesis

Diagnosticity Analysis¶

Most Diagnostic Evidence¶

Evidence ID	Why Diagnostic
SRC03-E01	The four-cause taxonomy discriminates between H1 (primary cause) and H3 (one factor). It contradicts H1's "primary" framing while confirming RLHF as a factor.
SRC02-E01	The "data not algorithm" insight discriminates between all three hypotheses. It confirms RLHF's role (contradicting H2), identifies it as an amplifier not the root cause (qualifying H1), and points to multi-pronged solutions (supporting H3).

Least Diagnostic Evidence¶

Evidence ID	Why Non-Diagnostic
SRC04-E01	The GPT-4o incident supports both H1 (RLHF caused it) and H3 (a specific misconfiguration, not inherent to RLHF). Does not discriminate well.

Outcome¶

Hypothesis supported: H3 — RLHF is a contributing factor (not the sole cause), and the response is multi-pronged with no dominant strategy.

Hypotheses eliminated: H2 — No evidence supports the claim that sycophancy is not attributed to RLHF. Every source identifies RLHF as at least a contributing factor.

Hypotheses inconclusive: H1 — Partially supported in that RLHF is recognized as significant and there are efforts to address sycophancy. But "primary cause" and "driving change" overstate the evidence. The movement toward RLHF alternatives is primarily driven by computational efficiency, not sycophancy concerns.