Having established the core principles and the role of the constitution, we now formalize the Constitutional AI (CAI) feedback process, particularly focusing on its supervised learning (SL) phase. This phase aims to refine the LLM's behavior by teaching it to emulate constitutionally-aligned revisions of its own outputs, generated through an AI-driven critique and revision loop.
Let Mθ represent the base large language model parameterized by θ. Given an input prompt x, the model generates an initial response yinit: yinit∼P(y∣x;θ) This notation indicates that yinit is sampled from the probability distribution over sequences y defined by the model Mθ conditioned on the input x.
The core of the CAI SL phase involves generating a more desirable response, yrevised, based on a predefined constitution C. This is typically achieved in two steps:
Critique Generation: An AI system (often the same LLM Mθ prompted appropriately, or a dedicated critique model Mcritique) analyzes the initial response yinit in the context of the prompt x and the constitution C. It identifies aspects of yinit that potentially violate principles in C and generates a critique c. We can represent this process functionally: c=Critique(x,yinit,C) This critique might be a textual explanation of the flaws or a specific instruction for improvement.
Revision Generation: Another AI system (potentially Mθ again, or a dedicated revision model Mrevise) takes the original prompt x, the initial response yinit, the critique c, and the constitution C as input. It then generates a revised response yrevised that addresses the critique while adhering to C. yrevised=Revise(x,yinit,c,C) Like the initial response generation, the critique and revision steps can be viewed probabilistically, involving sampling from distributions conditioned on their respective inputs.
The following diagram illustrates this data generation flow for a single training instance:
Data flow for generating a single supervised fine-tuning example in Constitutional AI. The process uses AI-generated critiques and revisions based on a constitution to refine an initial LLM response.
This entire process is repeated for a large dataset of diverse prompts {xi} to generate a dataset of constitutionally-aligned input-output pairs: DCAI={(xi,yrevised,i)}i=1N
The primary goal of the CAI SL phase is to train a new model, Maligned (often initialized from Mθ), using this dataset DCAI. The training objective is typically standard supervised fine-tuning (SFT), minimizing the negative log-likelihood of the revised responses given the prompts: LSFT(θaligned)=−N1∑i=1NlogP(yrevised,i∣xi;θaligned) Here, θaligned represents the parameters of the model being fine-tuned. By minimizing this loss, the model Maligned learns to generate outputs that resemble the revised, constitution-adhering responses directly, without needing the explicit critique and revision steps during inference.
It's important to recognize that this formulation simplifies a potentially complex process. The critique and revision steps themselves might involve multiple LLM calls, specific prompting strategies, and sampling procedures (e.g., using temperature scaling) to generate diverse and effective feedback.
Furthermore, while the primary output of this phase is the dataset DCAI for SFT, the generated data can sometimes be repurposed. For instance, the pair (yrevised,i,yinit,i) for a given prompt xi implicitly defines a preference: yrevised is preferred over yinit according to the constitution. This preference data could potentially be used to train a reward model for subsequent reinforcement learning stages (as explored in RLAIF, discussed later in this course), although the original CAI framework emphasizes the direct SFT step using the revised outputs.
This mathematical framing highlights how CAI translates high-level principles (the constitution) into concrete training data (DCAI) suitable for standard machine learning optimization techniques (LSFT), thereby guiding the LLM's behavior without requiring human labels for every specific interaction. The effectiveness hinges on the quality of the constitution and the capability of the AI systems to generate meaningful critiques and revisions based upon it.
Was this section helpful?
© 2025 ApX Machine Learning