Fine-tuning a Large Language Model (LLM) with specialized data is a core training step in the supervised learning phase, particularly when the data comprises constitutionally-aligned revisions. The objective is to internalize these constitutional principles, as reflected in the revised responses, directly into the model's parameters. This aims to produce a model, , that generates constitution-adherent outputs without explicitly invoking the critique-and-revise machinery during inference.
This process distills the complex, multi-step reasoning of the critique and revision phase into the weights of the LLM. The fine-tuning dataset, consisting of pairs , where is the original prompt and is the AI-generated, constitutionally revised response, serves as the training signal.
Fundamentally, this step employs standard supervised fine-tuning (SFT), a familiar technique for adapting pre-trained models to specific downstream tasks or styles. However, the nature of the CAI-generated data introduces specific considerations.
Input and Target: The input to the model during training is the prompt . The target the model learns to predict is the revised response . We are training the model to directly map prompts to responses that satisfy the constitutional constraints used during the data generation phase.
Model Architecture: Typically, you start the fine-tuning process from the same base model, , that was used to generate the initial responses. This ensures the model retains its general capabilities while adapting its behavior according to the CAI data.
Loss Function: The standard autoregressive language modeling loss, typically cross-entropy, is employed. The loss is calculated between the model's predicted probability distribution over the next token and the actual next token in the target sequence .
Here, represents the -th token in the target sequence , is the input prompt, are the preceding target tokens, and represents the model parameters being optimized.
Correctly formatting the data is essential for effective fine-tuning. Each training example typically concatenates the prompt and the revised response, often using special tokens to delineate the sections.
A common format might look like this:
<|system|> You are a helpful assistant aligned with constitutional principles. <|user|> {Prompt Text} <|assistant|> {Revised Response Text}<|endoftext|>
Crucially, the loss is only computed over the tokens corresponding to the {Revised Response Text} part. The prompt tokens and any preceding context or special tokens are masked out during the loss calculation. This ensures the model learns to generate the desired response conditioned on the prompt, rather than learning to predict the prompt itself. Libraries like Hugging Face's transformers provide tools (e.g., Data Collators) to handle this masking automatically.
While standard SFT practices apply, fine-tuning with CAI data requires careful attention to hyperparameters, given the potentially synthetic and targeted nature of the dataset:
Continuously evaluate the fine-tuning progress:
Loss curves for SFT. Training loss decreases while validation loss starts increasing after step 7, suggesting potential overfitting.
The successful completion of this fine-tuning process yields the CAI-aligned model, denoted as . This model has integrated the constitutional principles, as interpreted by the AI critique-and-revision process, into its generative behavior. It should now be capable of responding to prompts in a manner that aligns better with the specified constitution compared to the original .
This model represents the culmination of the supervised phase of Constitutional AI. It can be deployed directly, subjected to further rigorous evaluation (Chapter 7), or serve as an improved starting point for subsequent alignment stages, such as Reinforcement Learning from AI Feedback (RLAIF), as we will discuss in Chapter 6. The quality of this fine-tuning step is fundamental to the overall success of the CAI approach.
Was this section helpful?
transformers library, covering data preparation and training loop.© 2026 ApX Machine LearningEngineered with