R0040/2026-04-01/Q001/SRC06/E01¶
Constitutional AI replaces human preference annotation with AI self-critique under principles
URL: https://arxiv.org/abs/2212.08073
Extract¶
Constitutional AI trains a harmless AI assistant through self-improvement without human labels for harmful outputs. Human oversight is provided only through a set of principles (the "constitution").
Two-stage process: 1. Supervised phase: Model generates self-critiques and revisions based on constitutional principles, then finetunes on revised responses 2. RL phase (RLAIF): Another model evaluates pairs of samples against the constitution. A preference model is trained from these AI preferences. Policy is optimized via RL using this preference model as reward.
Key results: - RLAIF approaches are significantly more harmless than RLHF while maintaining helpfulness parity - Cost per preference judgment drops from $1+ (human) to <$0.01 (AI) - Anthropic uses this method in production for Claude training - 2026 update: Anthropic is moving from rule-following to teaching models why principles matter, for better generalization
Relevance to Hypotheses¶
Open-ended query -- maps to thematic clusters:
| Cluster | Relationship | Strength |
|---|---|---|
| AI-generated feedback | Supports | Primary evidence for RLAIF/CAI paradigm |
| Cost reduction | Supports | 100x+ reduction in annotation costs |
| Production deployment | Supports | Used in Claude production training |
Context¶
CAI is notable as the earliest and most established RLHF alternative in production. It modifies rather than replaces the RL optimization step -- the key change is replacing the source of feedback from humans to AI. This means CAI retains the RL pipeline's complexity but addresses its scalability bottleneck.