Skip to content

R0057/2026-04-01/C006/H1

Research R0057 — RLHF Yes-Men Claims v3
Run 2026-04-01
Claim C006
Hypothesis H1

Statement

All six alternatives exist and qualify as major

Status

Current: Supported

Supporting Evidence

Evidence Summary
SRC01-E01 All six named alternatives (DPO, KTO, GRPO, Constitutional AI, ORPO, RLVR) are documented in the literature

Contradicting Evidence

Evidence Summary
No contradicting evidence found

Reasoning

All six alternatives are documented across multiple technical surveys. DPO eliminates the reward model. KTO uses binary feedback. GRPO uses group-relative advantages. Constitutional AI uses principle-based feedback. ORPO combines SFT and preference optimization. RLVR uses programmatic verifiers.

Relationship to Other Hypotheses

H1 represents full accuracy. H2 allows for partial correctness. H3 is eliminated by the evidence.