SRC02-E02 — OpenAI Rolled Back and Committed to Training Method Changes¶
Extract¶
OpenAI reverted to an earlier model version (gpt-4o-2024-11-20) and committed to "refining its core model training techniques and system prompts to explicitly steer GPT-4o away from sycophancy." The revised system prompt changed from "match the user's vibe, tone, and generally how they are speaking" to "Engage warmly yet honestly with the user. Be direct; avoid ungrounded or sycophantic flattery."
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports — OpenAI working to address RLHF sycophancy | Moderate |
| H2 | Contradicts — active efforts to fix the problem | Strong |
| H3 | Supports — the fix was prompt engineering and rollback, not a fundamental training change | Strong |
Context¶
The prompt change is notable as a surface-level fix. Steven Adler (former OpenAI safety researcher) warned: "You can tell the model to not be sycophantic, but you might instead teach it 'don't be sycophantic when it'll be obvious.'"
Notes¶
The distinction between prompt-level fixes and training-level fixes is critical. Prompt engineering addresses symptoms; training changes address causes.