E01¶


Research	R0057 — RLHF Yes-Men Claims v3
Run	2026-04-01
Claim	C006
Source	SRC01
Evidence	SRC01-E01
Type	Factual

All six named alternatives (DPO, KTO, GRPO, Constitutional AI, ORPO, RLVR) are documented in the literature

URL: https://cbtw.tech/insights/rlhf-alternatives-post-training-optimization

Extract¶

All six alternatives are documented across multiple technical surveys. DPO eliminates the reward model. KTO uses binary feedback. GRPO uses group-relative advantages. Constitutional AI uses principle-based feedback. ORPO combines SFT and preference optimization. RLVR uses programmatic verifiers.

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Supports	Directly addresses claim accuracy
H2	Supports	Allows for partial correctness
H3	Contradicts	Evidence contradicts material inaccuracy

Context¶

These are all well-established in the ML literature with multiple implementations and adoption by major labs.