Skip to content

R0040/2026-04-01/Q001/SRC06/E01

Research R0040 — RLHF Alternatives
Run 2026-04-01
Query Q001
Source SRC06
Evidence SRC06-E01
Type Factual

Constitutional AI replaces human preference annotation with AI self-critique under principles

URL: https://arxiv.org/abs/2212.08073

Extract

Constitutional AI trains a harmless AI assistant through self-improvement without human labels for harmful outputs. Human oversight is provided only through a set of principles (the "constitution").

Two-stage process: 1. Supervised phase: Model generates self-critiques and revisions based on constitutional principles, then finetunes on revised responses 2. RL phase (RLAIF): Another model evaluates pairs of samples against the constitution. A preference model is trained from these AI preferences. Policy is optimized via RL using this preference model as reward.

Key results: - RLAIF approaches are significantly more harmless than RLHF while maintaining helpfulness parity - Cost per preference judgment drops from $1+ (human) to <$0.01 (AI) - Anthropic uses this method in production for Claude training - 2026 update: Anthropic is moving from rule-following to teaching models why principles matter, for better generalization

Relevance to Hypotheses

Open-ended query -- maps to thematic clusters:

Cluster Relationship Strength
AI-generated feedback Supports Primary evidence for RLAIF/CAI paradigm
Cost reduction Supports 100x+ reduction in annotation costs
Production deployment Supports Used in Claude production training

Context

CAI is notable as the earliest and most established RLHF alternative in production. It modifies rather than replaces the RL optimization step -- the key change is replacing the source of feedback from humans to AI. This means CAI retains the RL pipeline's complexity but addresses its scalability bottleneck.