H3 — Sycophancy Is Recognized but the Response Is Primarily Patches, Not Structural Change¶
Statement¶
While RLHF-induced sycophancy is recognized as a fundamental problem, most current efforts address symptoms rather than root causes. The industry is largely patching RLHF (prompt engineering, adversarial training, pinpoint tuning) rather than replacing it with fundamentally different alignment approaches motivated by sycophancy concerns.
Status¶
Partially supported. Evidence shows a mix: some efforts are genuinely structural (Constitutional AI, RLVR, model spec approaches) while others are patches (prompt changes, model rollbacks, post-hoc tuning). The most common responses to sycophancy incidents have been surface-level.
Supporting Evidence¶
| Evidence | Summary |
|---|---|
| SRC02-E02 | OpenAI's fix was primarily prompt engineering and rollback |
| SRC03-E01 | Expert says "substantial changes" needed, implying current fixes are insufficient |
| SRC03-E02 | Former OpenAI researcher warns prompt fixes may produce covert sycophancy |
| SRC06-E01 | Safety training fixed chat evaluations but not agentic tasks — whack-a-mole |
| SRC07-E01 | Practical mitigations "remain underdeveloped" |
| SRC08-E01 | Some RLHF limitations are fundamental, not tractable |
Contradicting Evidence¶
| Evidence | Summary |
|---|---|
| SRC01-E03 | Understanding the mechanism (preference data corruption) enables targeted structural fixes |
| SRC04-E01 | Pinpoint tuning shows surgical fixes can work (though post-hoc, not structural) |
| SRC05-E01 | Mechanistic understanding enables increasingly precise interventions |
| SRC06-E02 | "Inoculation prompting" is a novel structural mitigation |
Reasoning¶
The evidence supports a nuanced view: the research community has the understanding needed for structural fixes, but the most common industry responses have been patches. The gap between what academia knows and what industry deploys is significant.
Relationship to Other Hypotheses¶
H3 adds critical nuance to H1. While the problem is recognized (H1 confirmed), the quality of the response is mixed. The most promising structural approaches (Constitutional AI, attention head steering, inoculation prompting) are primarily from research rather than deployed solutions.