Having constructed the dataset of constitutionally-aligned revisions in the previous sections, we now arrive at the core training step of the supervised learning phase: fine-tuning the Large Language Model (LLM) using this specialized data. The objective here is to internalize the constitutional principles, as reflected in the revised responses, directly into the model's parameters. We aim to produce a model, MSFT, that generates constitution-adherent outputs without explicitly invoking the critique-and-revise machinery during inference.
This process distills the complex, multi-step reasoning of the critique and revision phase into the weights of the LLM. The fine-tuning dataset, consisting of pairs (P,Rrevised), where P is the original prompt and Rrevised is the AI-generated, constitutionally revised response, serves as the training signal.
At its heart, this step employs standard supervised fine-tuning (SFT), a familiar technique for adapting pre-trained models to specific downstream tasks or styles. However, the nature of the CAI-generated data introduces specific considerations.
Input and Target: The input to the model during training is the prompt P. The target the model learns to predict is the revised response Rrevised. We are training the model to directly map prompts to responses that satisfy the constitutional constraints used during the data generation phase.
Model Architecture: Typically, you start the fine-tuning process from the same base model, Mbase, that was used to generate the initial responses. This ensures the model retains its general capabilities while adapting its behavior according to the CAI data.
Loss Function: The standard autoregressive language modeling loss, typically cross-entropy, is employed. The loss is calculated between the model's predicted probability distribution over the next token and the actual next token in the target sequence Rrevised.
LSFT=−i∑logP(tokeni∣P,token<i;θ)Here, tokeni represents the i-th token in the target sequence Rrevised, P is the input prompt, token<i are the preceding target tokens, and θ represents the model parameters being optimized.
Correctly formatting the data is essential for effective fine-tuning. Each training example typically concatenates the prompt and the revised response, often using special tokens to delineate the sections.
A common format might look like this:
<|system|> You are a helpful assistant aligned with constitutional principles. <|user|> {Prompt Text} <|assistant|> {Revised Response Text}<|endoftext|>
Crucially, the loss is only computed over the tokens corresponding to the {Revised Response Text}
part. The prompt tokens and any preceding context or special tokens are masked out during the loss calculation. This ensures the model learns to generate the desired response conditioned on the prompt, rather than learning to predict the prompt itself. Libraries like Hugging Face's transformers
provide tools (e.g., Data Collators) to handle this masking automatically.
While standard SFT practices apply, fine-tuning with CAI data requires careful attention to hyperparameters, given the potentially synthetic and targeted nature of the dataset:
Continuously evaluate the fine-tuning progress:
Loss curves for SFT. Training loss decreases while validation loss starts increasing after step 7, suggesting potential overfitting.
The successful completion of this fine-tuning process yields the CAI-aligned model, denoted as MSFT. This model has integrated the constitutional principles, as interpreted by the AI critique-and-revision process, into its generative behavior. It should now be capable of responding to prompts in a manner that aligns better with the specified constitution compared to the original Mbase.
This MSFT model represents the culmination of the supervised phase of Constitutional AI. It can be deployed directly, subjected to further rigorous evaluation (Chapter 7), or serve as an improved starting point for subsequent alignment stages, such as Reinforcement Learning from AI Feedback (RLAIF), as we will discuss in Chapter 6. The quality of this fine-tuning step is fundamental to the overall success of the CAI approach.
© 2025 ApX Machine Learning