Constitutional AI replaces thousands of human safety labels with ~75 written principles that the model uses to critique and rewrite its own answers. That's the short answer. The longer answer is more interesting.
When Anthropic published the technique in December 2022, the pitch was almost boring: write down the rules, then have the AI grade itself against them. But that one move โ taking humans out of the harm-labeling loop โ is why Claude can be trained at a scale and consistency that an army of annotators could never match. As a founder who has watched a dozen AI startups try to bolt safety on after the fact, this is the rare case where the safety method is also the cost structure.
Anthropic Constitutional AI explained: what it actually is
Constitutional AI is a training method, introduced by Anthropic in its December 2022 paper "Constitutional AI: Harmlessness from AI Feedback," that aligns a language model using a short written set of principles โ roughly 75 in Anthropic's published constitution โ instead of large volumes of human-labeled examples of harmful content. The model reads the constitution, critiques its own draft responses against it, rewrites them, and is then trained on the improved versions. Humans never label the harmful outputs directly.
That is the whole trick. Conventional safety tuning leans on RLHF โ reinforcement learning from human feedback โ where people rank thousands of model outputs by hand. Constitutional AI swaps that for RLAIF: reinforcement learning from AI feedback. A model judges the outputs using the written rules, so the number of human harm labels drops toward zero while the volume of training signal goes up.
How Constitutional AI works: the two phases
The method runs in two distinct phases. The first is supervised; the second is reinforcement learning. Both lean on the same constitution.
Phase 1 โ Supervised (self-critique & revise)
A helpful-but-unaligned model is prompted with potentially harmful questions. It generates a response, then is asked to critique that response against a randomly drawn constitutional principle and rewrite it. The revised, less-harmful answers become the supervised fine-tuning dataset. No human labels the harm.
Phase 2 โ RLAIF (AI preference model)
The fine-tuned model generates pairs of responses. A separate AI judge picks which one better follows the constitution, producing a preference dataset. That dataset trains a preference model, which then drives standard reinforcement learning โ the same RL machinery as RLHF, but with AI-generated harm preferences instead of human ones.
Anthropic still uses human feedback for one thing: helpfulness. The split is the point โ humans judge whether Claude is useful, the constitution judges whether it is harmful. That division of labor is what the original paper's subtitle, "Harmlessness from AI Feedback," is pointing at.
RLHF vs RLAIF vs Constitutional AI: a side-by-side
The cleanest way to understand why Anthropic went this direction is to put the three approaches next to each other across the dimensions that actually matter for cost and consistency.
| Attribute | Standard RLHF | Constitutional AI (RLAIF) |
|---|---|---|
| Harm labeling | Tens of thousands of human comparisons | ~0 human harm labels; AI judges vs constitution |
| Rules location | Implicit, inside labelers' heads | Explicit, ~75 written principles |
| Cost driver | Human annotator hours (slow, expensive) | Compute for self-critique (cheap, scalable) |
| Consistency | Varies by labeler and mood | Same constitution applied every time |
| Transparency | Hard to audit why a model refused | Principles are published and inspectable |
| Auditability of changes | Re-label thousands of examples | Edit the text of a principle |
| Year introduced | 2017 (RLHF), 2022 (InstructGPT-style) | December 2022 (Anthropic) |
The economic story is the one founders should internalize. RLHF's cost scales with human hours. Constitutional AI's cost scales with compute, which is falling ~10x every couple of years. When your safety budget rides the same curve as your training budget, you can afford to be more thorough, not less.
What is actually in the constitution
Anthropic published the full list. The ~75 principles are not invented from scratch โ they are stitched together from existing, recognized sources, which is part of why the approach is defensible rather than arbitrary.
UN Universal Declaration of Human Rights (1948)
Core principles on dignity, freedom, and non-discrimination
Apple's Terms of Service
Practical guardrails around privacy and offensive content
DeepMind's Sparrow rules
Principles from a peer lab's dialogue-safety work
Anthropic's own research
Principles encouraging non-Western perspectives and avoiding preachiness
Sample principles read like plain instructions: "Choose the response that is least likely to be viewed as harmful or offensive to a non-Western audience," or "Choose the response that is least threatening or aggressive." One principle even tells the model to avoid sounding too preachy or condescending โ a direct fix for the lecturing tone that early safety tuning produced. In 2023, Anthropic ran Collective Constitutional AI, crowdsourcing principles from roughly 1,000 Americans to test whether a publicly-sourced constitution behaves differently from the in-house one.
What Anthropic's Constitutional AI safety approach does โ and doesn't โ fix
This is where I'd push back on the marketing. Constitutional AI is a genuine advance, but it is one layer in a stack, not a solved-alignment claim. The 2022 paper's real result was a better tradeoff curve โ models that were both more harmless and more helpful than the RLHF baseline, and that explained their refusals instead of dodging.
What it improves
- โ Near-zero human labeling of harmful content
- โ Transparent, editable, published rules
- โ Refusals that explain themselves vs. stonewalling
- โ A better harmlessness-vs-helpfulness frontier
What it does not solve
- โ Jailbreaks and adversarial prompts still work
- โ Hallucination and factual errors are untouched
- โ Whose values? The constitution is a value choice
- โ A model judging itself can share its own blind spots
That last point is the deepest one. If the model evaluating responses has the same blind spots as the model generating them, RLAIF can reinforce a shared error. Anthropic knows this, which is why Claude ships with red-teaming, a usage policy, output classifiers, and the broader Responsible Scaling Policy on top of the constitution. The constitution is the foundation, not the whole building. For founders tracking how this plays into model pricing and capability tiers, our AI Valuations dashboard and AI Landscape track where the labs are spending.
Why this matters for founders and operators
If you're building on top of Claude or any frontier model, the lesson isn't "copy the constitution." It's that explicit, written rules scale and audit better than implicit ones baked into labeled data. When a regulator, an enterprise buyer, or your own board asks "why did your AI refuse this?", an answer you can point to in a text file beats "our labelers felt that way."
The same logic applies to your own product guardrails. Across the 65+ companies I've invested in, the AI products that survive enterprise procurement are the ones that can show their rules, not just assert their safety. Constitutional AI is the highest-profile proof that you can make alignment cheaper and more legible at the same time โ and that combination is rare enough to study.
Constitutional AI didn't solve alignment.
It proved that safety can scale with compute instead of headcount โ and that the rules should be written down where anyone can read them.
Track AI model economics and lab spending on the AI Landscape Dashboard at Value Add VC. Originally published in the Trace Cohen newsletter.