E01¶


Research	R0040 — RLHF Alternatives
Run	2026-03-28
Query	Q001
Source	SRC03
Evidence	SRC03-E01
Type	Factual

Constitutional AI replaces human preference labeling with AI self-critique against principles.

URL: https://arxiv.org/abs/2212.08073

Extract¶

Constitutional AI (CAI) operates through two phases:

Supervised Learning (Critique) Phase: A model generates responses, then critiques and revises them iteratively against a set of human-written constitutional principles addressing truthfulness, safety, and helpfulness.
RLAIF Phase: An LLM-as-a-judge compares completion pairs using the constitutional principles as context, selecting which output better aligns with stated values. This AI-generated preference data is used to train a preference model, which then trains the policy via RL.

CAI is documented as "the earliest documented, large-scale use of synthetic data for RLHF training" and "kickstarted the broader field of RLAIF."

Anthropic has deployed CAI for all Claude models. The constitution grew from 2,700 words in 2023 to 23,000 words in 2026, with Claude itself using the constitution to construct synthetic training data.

Relevance to Hypotheses¶

Hypothesis	Relationship	Strength
H1	Supports	CAI is a distinct alternative with documented production deployment at scale
H2	Contradicts	Anthropic has successfully used CAI as the primary alignment method for years
H3	Supports	CAI retains the RL optimization loop; it changes the feedback source, not the optimization paradigm

Context¶

CAI's key innovation is replacing the expensive and noisy human annotation process with principle-guided AI feedback. This changes WHO provides the feedback but not HOW the feedback is used (it still trains a reward model and optimizes via RL). The 2026 constitution at 23,000 words represents significant ongoing investment in this approach.