R0040/2026-04-01/Q001/SRC04/E01¶
KTO applies prospect theory to LLM alignment with binary feedback
URL: https://arxiv.org/abs/2402.01306
Extract¶
KTO directly maximizes the utility of generations using Kahneman-Tversky prospect theory's human utility model, rather than maximizing preference log-likelihoods like DPO. Key innovation: it learns from only a binary signal of whether an output is desirable -- no preference comparisons required.
Results: - Matches or exceeds DPO performance at scales from 1B to 30B parameters - Requires only binary desirability labels (thumbs up/down), not ranked preference pairs - Dramatically reduces annotation overhead - Introduces the concept of Human-Aware Losses (HALOs): a family of loss functions that implicitly incorporate human cognitive biases
Relevance to Hypotheses¶
Open-ended query -- maps to thematic clusters:
| Cluster | Relationship | Strength |
|---|---|---|
| Reward-free preference optimization | Supports | Eliminates both reward model and preference pair requirement |
| Data efficiency | Supports | Binary labels are far cheaper to collect than preference pairs |
| Theoretical grounding | Supports | Prospect theory provides principled foundation for loss design |
Context¶
KTO is notable for its theoretical novelty -- it is the only major RLHF alternative grounded in behavioral economics. The connection to prospect theory suggests that human cognitive biases should be explicitly modeled in alignment, not treated as noise.