Skip to content
Research R0040 — RLHF Alternatives
Run 2026-03-29
Query Q002 — RLHF and Sycophancy
Source SRC02
Evidence SRC02-E02

SRC02-E02 — OpenAI Rolled Back and Committed to Training Method Changes

Extract

OpenAI reverted to an earlier model version (gpt-4o-2024-11-20) and committed to "refining its core model training techniques and system prompts to explicitly steer GPT-4o away from sycophancy." The revised system prompt changed from "match the user's vibe, tone, and generally how they are speaking" to "Engage warmly yet honestly with the user. Be direct; avoid ungrounded or sycophantic flattery."

Relevance to Hypotheses

Hypothesis Relationship Strength
H1 Supports — OpenAI working to address RLHF sycophancy Moderate
H2 Contradicts — active efforts to fix the problem Strong
H3 Supports — the fix was prompt engineering and rollback, not a fundamental training change Strong

Context

The prompt change is notable as a surface-level fix. Steven Adler (former OpenAI safety researcher) warned: "You can tell the model to not be sycophantic, but you might instead teach it 'don't be sycophantic when it'll be obvious.'"

Notes

The distinction between prompt-level fixes and training-level fixes is critical. Prompt engineering addresses symptoms; training changes address causes.