What is Anthropic's Constitutional AI in simple terms?

Constitutional AI is a training method where an AI model is given a short written set of rules — Anthropic's published constitution runs to roughly 75 principles — and learns to critique and rewrite its own answers to follow them. Instead of thousands of humans labeling harmful outputs one by one, the model supervises itself against the written constitution, which is why Anthropic calls the technique RLAIF rather than RLHF.

What is the difference between RLHF and RLAIF?

RLHF (reinforcement learning from human feedback) uses people to rank model outputs, which is slow and expensive — large labs have paid for tens of thousands of human comparison labels. RLAIF (reinforcement learning from AI feedback), the core of Constitutional AI, replaces the human harmlessness labels with an AI model that judges responses against ~75 written principles. Anthropic still uses humans for helpfulness, but the harm-labeling step is automated.

What is actually in Anthropic's constitution?

Anthropic's published constitution draws on sources including the 1948 UN Universal Declaration of Human Rights, Apple's and DeepMind's terms-of-service principles, and Anthropic's own research. The roughly 75 principles ask the model to avoid harmful, unethical, racist, or illegal content and to prefer responses that are wise, peaceful, and respectful of non-Western perspectives.

Does Constitutional AI make Claude completely safe?

No. Constitutional AI reduces harmful outputs and makes refusals less evasive, but it does not eliminate jailbreaks, hallucinations, or value disagreements. The 2022 paper showed the method improves the harmlessness-helpfulness tradeoff, not that it solves alignment. Anthropic layers it with red-teaming, usage policies, and classifiers rather than relying on the constitution alone.

When did Anthropic introduce Constitutional AI?

Anthropic published the Constitutional AI paper in December 2022, and the technique has underpinned every Claude model since. The original work was titled 'Constitutional AI: Harmlessness from AI Feedback,' and Anthropic has since iterated on the constitution and added Collective Constitutional AI, a 2023 experiment that crowdsourced principles from about 1,000 Americans.

Constitutional AI: ~75 Principles, RLHF vs RLAIF

Constitutional AI replaces thousands of human safety labels with ~75 written principles that the model uses to critique and rewrite its own answers. That's the short answer. The longer answer is more interesting.

When Anthropic published the technique in December 2022, the pitch was almost boring: write down the rules, then have the AI grade itself against them. But that one move — taking humans out of the harm-labeling loop — is why Claude can be trained at a scale and consistency that an army of annotators could never match. As a founder who has watched a dozen AI startups try to bolt safety on after the fact, this is the rare case where the safety method is also the cost structure.

Anthropic Constitutional AI explained: what it actually is

Constitutional AI is a training method, introduced by Anthropic in its December 2022 paper "Constitutional AI: Harmlessness from AI Feedback," that aligns a language model using a short written set of principles — roughly 75 in Anthropic's published constitution — instead of large volumes of human-labeled examples of harmful content. The model reads the constitution, critiques its own draft responses against it, rewrites them, and is then trained on the improved versions. Humans never label the harmful outputs directly.

That is the whole trick. Conventional safety tuning leans on RLHF — reinforcement learning from human feedback — where people rank thousands of model outputs by hand. Constitutional AI swaps that for RLAIF: reinforcement learning from AI feedback. A model judges the outputs using the written rules, so the number of human harm labels drops toward zero while the volume of training signal goes up.

How Constitutional AI works: the two phases

The method runs in two distinct phases. The first is supervised; the second is reinforcement learning. Both lean on the same constitution.

Phase 1 — Supervised (self-critique & revise)

A helpful-but-unaligned model is prompted with potentially harmful questions. It generates a response, then is asked to critique that response against a randomly drawn constitutional principle and rewrite it. The revised, less-harmful answers become the supervised fine-tuning dataset. No human labels the harm.

Phase 2 — RLAIF (AI preference model)

The fine-tuned model generates pairs of responses. A separate AI judge picks which one better follows the constitution, producing a preference dataset. That dataset trains a preference model, which then drives standard reinforcement learning — the same RL machinery as RLHF, but with AI-generated harm preferences instead of human ones.

Anthropic still uses human feedback for one thing: helpfulness. The split is the point — humans judge whether Claude is useful, the constitution judges whether it is harmful. That division of labor is what the original paper's subtitle, "Harmlessness from AI Feedback," is pointing at.

RLHF vs RLAIF vs Constitutional AI: a side-by-side

The cleanest way to understand why Anthropic went this direction is to put the three approaches next to each other across the dimensions that actually matter for cost and consistency.

Attribute	Standard RLHF	Constitutional AI (RLAIF)
Harm labeling	Tens of thousands of human comparisons	~0 human harm labels; AI judges vs constitution
Rules location	Implicit, inside labelers' heads	Explicit, ~75 written principles
Cost driver	Human annotator hours (slow, expensive)	Compute for self-critique (cheap, scalable)
Consistency	Varies by labeler and mood	Same constitution applied every time
Transparency	Hard to audit why a model refused	Principles are published and inspectable
Auditability of changes	Re-label thousands of examples	Edit the text of a principle
Year introduced	2017 (RLHF), 2022 (InstructGPT-style)	December 2022 (Anthropic)

The economic story is the one founders should internalize. RLHF's cost scales with human hours. Constitutional AI's cost scales with compute, which is falling ~10x every couple of years. When your safety budget rides the same curve as your training budget, you can afford to be more thorough, not less.

What is actually in the constitution

Anthropic published the full list. The ~75 principles are not invented from scratch — they are stitched together from existing, recognized sources, which is part of why the approach is defensible rather than arbitrary.

UN Universal Declaration of Human Rights (1948)

Core principles on dignity, freedom, and non-discrimination

Apple's Terms of Service

Practical guardrails around privacy and offensive content

DeepMind's Sparrow rules

Principles from a peer lab's dialogue-safety work

Anthropic's own research

Principles encouraging non-Western perspectives and avoiding preachiness

Sample principles read like plain instructions: "Choose the response that is least likely to be viewed as harmful or offensive to a non-Western audience," or "Choose the response that is least threatening or aggressive." One principle even tells the model to avoid sounding too preachy or condescending — a direct fix for the lecturing tone that early safety tuning produced. In 2023, Anthropic ran Collective Constitutional AI, crowdsourcing principles from roughly 1,000 Americans to test whether a publicly-sourced constitution behaves differently from the in-house one.

What Anthropic's Constitutional AI safety approach does — and doesn't — fix

This is where I'd push back on the marketing. Constitutional AI is a genuine advance, but it is one layer in a stack, not a solved-alignment claim. The 2022 paper's real result was a better tradeoff curve — models that were both more harmless and more helpful than the RLHF baseline, and that explained their refusals instead of dodging.

What it improves

✓ Near-zero human labeling of harmful content
✓ Transparent, editable, published rules
✓ Refusals that explain themselves vs. stonewalling
✓ A better harmlessness-vs-helpfulness frontier

What it does not solve

✕ Jailbreaks and adversarial prompts still work
✕ Hallucination and factual errors are untouched
✕ Whose values? The constitution is a value choice
✕ A model judging itself can share its own blind spots

That last point is the deepest one. If the model evaluating responses has the same blind spots as the model generating them, RLAIF can reinforce a shared error. Anthropic knows this, which is why Claude ships with red-teaming, a usage policy, output classifiers, and the broader Responsible Scaling Policy on top of the constitution. The constitution is the foundation, not the whole building. For founders tracking how this plays into model pricing and capability tiers, our AI Valuations dashboard and AI Landscape track where the labs are spending.

Why this matters for founders and operators

If you're building on top of Claude or any frontier model, the lesson isn't "copy the constitution." It's that explicit, written rules scale and audit better than implicit ones baked into labeled data. When a regulator, an enterprise buyer, or your own board asks "why did your AI refuse this?", an answer you can point to in a text file beats "our labelers felt that way."

The same logic applies to your own product guardrails. Across the 65+ companies I've invested in, the AI products that survive enterprise procurement are the ones that can show their rules, not just assert their safety. Constitutional AI is the highest-profile proof that you can make alignment cheaper and more legible at the same time — and that combination is rare enough to study.

Constitutional AI didn't solve alignment.

It proved that safety can scale with compute instead of headcount — and that the rules should be written down where anyone can read them.

Track AI model economics and lab spending on the AI Landscape Dashboard at Value Add VC. Originally published in the Trace Cohen newsletter.

Get VC data most people never see

— 100% free

Weekly benchmarks, valuations, and fund data. Join 5,000+ investors. No spam.

Anthropic Constitutional AI explained: what it actually is

How Constitutional AI works: the two phases

The method runs in two distinct phases. The first is supervised; the second is reinforcement learning. Both lean on the same constitution.

Phase 1 — Supervised (self-critique & revise)

Phase 2 — RLAIF (AI preference model)

RLHF vs RLAIF vs Constitutional AI: a side-by-side

The cleanest way to understand why Anthropic went this direction is to put the three approaches next to each other across the dimensions that actually matter for cost and consistency.

Attribute	Standard RLHF	Constitutional AI (RLAIF)
Harm labeling	Tens of thousands of human comparisons	~0 human harm labels; AI judges vs constitution
Rules location	Implicit, inside labelers' heads	Explicit, ~75 written principles
Cost driver	Human annotator hours (slow, expensive)	Compute for self-critique (cheap, scalable)
Consistency	Varies by labeler and mood	Same constitution applied every time
Transparency	Hard to audit why a model refused	Principles are published and inspectable
Auditability of changes	Re-label thousands of examples	Edit the text of a principle
Year introduced	2017 (RLHF), 2022 (InstructGPT-style)	December 2022 (Anthropic)

What is actually in the constitution

UN Universal Declaration of Human Rights (1948)

Core principles on dignity, freedom, and non-discrimination

Apple's Terms of Service

Practical guardrails around privacy and offensive content

DeepMind's Sparrow rules

Principles from a peer lab's dialogue-safety work

Anthropic's own research

Principles encouraging non-Western perspectives and avoiding preachiness

What Anthropic's Constitutional AI safety approach does — and doesn't — fix

What it improves

✓ Near-zero human labeling of harmful content
✓ Transparent, editable, published rules
✓ Refusals that explain themselves vs. stonewalling
✓ A better harmlessness-vs-helpfulness frontier

What it does not solve

✕ Jailbreaks and adversarial prompts still work
✕ Hallucination and factual errors are untouched
✕ Whose values? The constitution is a value choice
✕ A model judging itself can share its own blind spots

Why this matters for founders and operators

Constitutional AI didn't solve alignment.

It proved that safety can scale with compute instead of headcount — and that the rules should be written down where anyone can read them.

Track AI model economics and lab spending on the AI Landscape Dashboard at Value Add VC. Originally published in the Trace Cohen newsletter.

Get VC data most people never see

— 100% free

Weekly benchmarks, valuations, and fund data. Join 5,000+ investors. No spam.

Anthropic Constitutional AI Explained: What the Safety Approach Actually Means

Anthropic Constitutional AI explained: what it actually is

How Constitutional AI works: the two phases

RLHF vs RLAIF vs Constitutional AI: a side-by-side

What is actually in the constitution

What Anthropic's Constitutional AI safety approach does — and doesn't — fix

Why this matters for founders and operators

Frequently Asked Questions

Related Tools & Dashboards

Keep Reading

Anthropic Constitutional AI Explained: What the Safety Approach Actually Means

Anthropic Constitutional AI explained: what it actually is

How Constitutional AI works: the two phases

RLHF vs RLAIF vs Constitutional AI: a side-by-side

What is actually in the constitution

What Anthropic's Constitutional AI safety approach does — and doesn't — fix

Why this matters for founders and operators

Frequently Asked Questions

Related Tools & Dashboards

Keep Reading