Following the establishment of core principles and the methodology for designing effective constitutions, we now examine the practical application of these principles in the initial training stage of Constitutional AI: the Supervised Learning (SL) phase. This phase operationalizes the constitution by using AI itself to generate examples of desirable behavior, thereby creating a dataset for fine-tuning the language model towards alignment before any reinforcement learning steps.
The core objective here is to translate the abstract rules of the constitution into concrete training signals. Instead of relying solely on human annotators to label preferred outputs, CAI leverages an AI system to critique and revise the model's own responses based on the provided constitutional principles. This process forms a scalable mechanism for generating alignment data.
This SL phase typically unfolds in two main stages, often executed iteratively:
Let's analyze each stage in detail.
The process begins by prompting the initial, pre-aligned Large Language Model (LLM) to generate a response to a given input prompt. This initial response might be helpful but could potentially violate one or more principles outlined in the constitution (e.g., being evasive, generating harmful content, expressing undesirable biases).
Once a critique is generated, the next step is to revise the initial response to address the identified issues.
The power of this critique-and-revision loop lies in its ability to automatically generate training data for supervised fine-tuning (SFT). Each pass through the loop yields a pair: the initial prompt and the revised, constitution-aligned response.
SFT Training Pair=(Input Prompt,Revised Response)By generating many such pairs across a diverse set of prompts, a substantial dataset is constructed. This dataset embodies the constitutional principles in the form of input-output examples. The original LLM is then fine-tuned on this dataset. The SFT objective teaches the model to directly produce responses similar to the revised outputs when presented with the corresponding prompts, effectively internalizing the constitutional constraints demonstrated during the critique-and-revision process.
Flow of the CAI Supervised Learning phase. An initial response is critiqued based on the constitution, then revised based on the critique, generating a training pair for supervised fine-tuning.
This self-correction mechanism, guided by an explicit constitution, allows for scaling the alignment process beyond what is feasible with human labeling alone. The resulting model, fine-tuned on this AI-generated data, becomes the foundation for a more constitutionally-aligned system, potentially ready for further refinement through reinforcement learning techniques like RLAIF, which we will discuss in later chapters. The quality of this phase heavily depends on the clarity of the constitution and the sophistication of the prompting strategies used to guide the critique and revision models effectively.
© 2025 ApX Machine Learning