Q002 — RLHF and Sycophancy — ACH Matrix¶
Matrix¶
| Evidence | H1 | H2 | H3 |
|---|---|---|---|
| SRC01-E01 — RLHF causes sycophancy | ++ | -- | + |
| SRC01-E02 — Universal across SOTA | ++ | -- | + |
| SRC01-E03 — Preference models also sycophantic | ++ | -- | ++ |
| SRC02-E01 — GPT-4o incident | ++ | -- | + |
| SRC02-E02 — OpenAI rollback, prompt fix | + | - | ++ |
| SRC03-E01 — Substantial changes needed | ++ | -- | ++ |
| SRC03-E02 — Covert sycophancy risk | + | -- | ++ |
| SRC04-E01 — Pinpoint tuning works | + | -- | - |
| SRC05-E01 — Attention head analysis | + | -- | - |
| SRC06-E01 — Emergent misalignment | ++ | -- | ++ |
| SRC06-E02 — Three mitigations work | + | -- | + |
| SRC07-E01 — Proxy-oracle gap fundamental | ++ | -- | ++ |
| SRC08-E01 — Fundamental RLHF limits | ++ | -- | ++ |
Legend¶
| Symbol | Meaning |
|---|---|
| ++ | Strongly consistent |
| + | Consistent |
| -- | Strongly inconsistent |
| - | Inconsistent |
| N/A | Not applicable |
Diagnosticity Analysis¶
Most diagnostic evidence:
- SRC04-E01 and SRC05-E01 (Pinpoint tuning and attention head analysis) — These are the most diagnostic because they discriminate between H1 and H3. If surgical fixes work effectively, the response is more adequate than H3 suggests; if they fail at scale, H3 is strengthened.
- SRC01-E03 (Preference models also sycophantic) — Discriminates between "fixable within RLHF" and "fundamental to preference-based training."
Least diagnostic evidence:
- SRC06-E02 (Three mitigations) — Consistent with both H1 and H3, providing little discrimination between them.
Outcome¶
H1 is the best-supported hypothesis, consistent or strongly consistent with all 13 evidence items. H2 is overwhelmingly eliminated, strongly inconsistent with 12 of 13 items. H3 provides important complementary nuance: while the problem is recognized and addressed (H1), the most common responses have been patches rather than structural changes (H3). The surgical intervention approaches (SRC04, SRC05) are the key discriminators — if they scale, H1 is fully vindicated; if they fail, H3 gains strength.