R0040/2026-03-28/Q001/SRC03/E01¶
Constitutional AI replaces human preference labeling with AI self-critique against principles.
URL: https://arxiv.org/abs/2212.08073
Extract¶
Constitutional AI (CAI) operates through two phases:
-
Supervised Learning (Critique) Phase: A model generates responses, then critiques and revises them iteratively against a set of human-written constitutional principles addressing truthfulness, safety, and helpfulness.
-
RLAIF Phase: An LLM-as-a-judge compares completion pairs using the constitutional principles as context, selecting which output better aligns with stated values. This AI-generated preference data is used to train a preference model, which then trains the policy via RL.
CAI is documented as "the earliest documented, large-scale use of synthetic data for RLHF training" and "kickstarted the broader field of RLAIF."
Anthropic has deployed CAI for all Claude models. The constitution grew from 2,700 words in 2023 to 23,000 words in 2026, with Claude itself using the constitution to construct synthetic training data.
Relevance to Hypotheses¶
| Hypothesis | Relationship | Strength |
|---|---|---|
| H1 | Supports | CAI is a distinct alternative with documented production deployment at scale |
| H2 | Contradicts | Anthropic has successfully used CAI as the primary alignment method for years |
| H3 | Supports | CAI retains the RL optimization loop; it changes the feedback source, not the optimization paradigm |
Context¶
CAI's key innovation is replacing the expensive and noisy human annotation process with principle-guided AI feedback. This changes WHO provides the feedback but not HOW the feedback is used (it still trains a reward model and optimizes via RL). The 2026 constitution at 23,000 words represents significant ongoing investment in this approach.