SRC03-E01 — Constitutional AI Replaces Human Feedback with Principles¶
Extract¶
Constitutional AI trains "a harmless AI assistant through self-improvement without human labels for harmful outputs." The approach uses "a list of rules or principles" (a "constitution") and involves two phases: supervised learning with AI self-critique and revision, followed by "RL from AI Feedback" (RLAIF) where "an AI model evaluates response quality" instead of human annotators. The method "creates more harmless models with minimal impact on helpfulness."
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Strongly supports — CAI is a deployed alternative to RLHF | Strong |
| H2 | Contradicts — CAI is in production at Anthropic | Strong |
| H3 | Supports — CAI partly replaces and partly augments RLHF | Moderate |
Context¶
Constitutional AI is notable for being the first major alternative that changed the feedback source (from human to AI) rather than just the optimization algorithm. It is deployed in production in Anthropic's Claude models.
Notes¶
CAI still uses an RL training loop; the innovation is in the feedback mechanism rather than eliminating RL entirely. In this sense it is more accurately RLAIF than a full RLHF replacement.