E01¶


Research	R0040 — RLHF Alternatives
Run	2026-04-01
Query	Q001
Source	SRC06
Evidence	SRC06-E01
Type	Factual

Constitutional AI replaces human preference annotation with AI self-critique under principles

URL: https://arxiv.org/abs/2212.08073

Extract¶

Constitutional AI trains a harmless AI assistant through self-improvement without human labels for harmful outputs. Human oversight is provided only through a set of principles (the "constitution").

Two-stage process: 1. Supervised phase: Model generates self-critiques and revisions based on constitutional principles, then finetunes on revised responses 2. RL phase (RLAIF): Another model evaluates pairs of samples against the constitution. A preference model is trained from these AI preferences. Policy is optimized via RL using this preference model as reward.

Key results: - RLAIF approaches are significantly more harmless than RLHF while maintaining helpfulness parity - Cost per preference judgment drops from $1+ (human) to <$0.01 (AI) - Anthropic uses this method in production for Claude training - 2026 update: Anthropic is moving from rule-following to teaching models why principles matter, for better generalization

Relevance to Hypotheses¶

Open-ended query -- maps to thematic clusters:

Cluster	Relationship	Strength
AI-generated feedback	Supports	Primary evidence for RLAIF/CAI paradigm
Cost reduction	Supports	100x+ reduction in annotation costs
Production deployment	Supports	Used in Claude production training

Context¶

CAI is notable as the earliest and most established RLHF alternative in production. It modifies rather than replaces the RL optimization step -- the key change is replacing the source of feedback from humans to AI. This means CAI retains the RL pipeline's complexity but addresses its scalability bottleneck.