SRC07-E01 — KTO Uses Binary Signals, Not Preferences¶
Extract¶
KTO "directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do." It requires only "a binary signal of whether output is desirable or not" rather than comparative preferences like "Output A trumps output B." KTO "matches or exceeds the performance of preference-based methods at scales from 1B to 30B" and its theoretical foundation is Kahneman and Tversky's prospect theory, specifically the concept of loss aversion.
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Strongly supports — KTO is a fundamentally different approach to alignment | Strong |
| H2 | Contradicts — KTO demonstrates viable production alternative | Strong |
| H3 | Supports — KTO targets the data collection problem (binary vs preference) specifically | Moderate |
Context¶
KTO's use of binary signals rather than preferences is practically significant because thumbs-up/down data is far more abundant than comparative preference data.
Notes¶
The irony is that the OpenAI GPT-4o sycophancy incident involved over-optimization on thumbs-up/down signals — exactly the kind of data KTO uses. This suggests the signal type alone does not solve sycophancy.