An approach developed by Anthropic where an AI system is trained to follow a set of principles (a 'constitution') rather than relying solely on human feedback for every decision.
An approach developed by Anthropic where an AI system is trained to follow a set of principles (a 'constitution') rather than relying solely on human feedback for every decision. The model critiques and revises its own outputs based on these principles. Used to make Claude safer and more helpful.
Constitutional AI (CAI) is Anthropic's approach to training AI systems to be helpful, harmless, and honest. Instead of relying entirely on human feedback to judge every response, CAI gives the model a set of principles — a "constitution" — and has the AI critique and revise its own outputs based on those principles.
The process works in two phases. First, the model generates responses, then critiques them according to the constitution (which includes principles like "choose the response that is least likely to be harmful"). The model revises its responses based on its own critiques. Second, these revised responses are used to train a reward model via RLHF, replacing the need for human labelers to judge every single output.
The appeal of CAI is scalability and transparency. You can read the constitution and understand what values the system is supposed to follow. It's also more efficient than pure RLHF since the AI handles much of the evaluation. Claude is trained using this approach. The principles cover everything from avoiding harm to being honest about uncertainty, and they can be updated as our understanding of AI safety evolves.
"Anthropic's Constitutional AI approach means Claude evaluates its own responses against a set of written principles before they're finalized during training."
Reinforcement Learning from Human Feedback.
The research field focused on making sure AI systems do what humans actually want them to do.
An AI safety company founded in 2021 by former OpenAI researchers, including Dario and Daniela Amodei.
A mathematical function applied to a neuron's output that introduces non-linearity into the network.
An optimization algorithm that combines the best parts of two other methods — AdaGrad and RMSProp.
Artificial General Intelligence.
Browse our complete glossary or subscribe to our newsletter for the latest AI news and insights.