R0057/2026-04-01/C006/SRC01/E01¶
All six named alternatives (DPO, KTO, GRPO, Constitutional AI, ORPO, RLVR) are documented in the literature
URL: https://cbtw.tech/insights/rlhf-alternatives-post-training-optimization
Extract¶
All six alternatives are documented across multiple technical surveys. DPO eliminates the reward model. KTO uses binary feedback. GRPO uses group-relative advantages. Constitutional AI uses principle-based feedback. ORPO combines SFT and preference optimization. RLVR uses programmatic verifiers.
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports | Directly addresses claim accuracy |
| H2 | Supports | Allows for partial correctness |
| H3 | Contradicts | Evidence contradicts material inaccuracy |
Context¶
These are all well-established in the ML literature with multiple implementations and adoption by major labs.