The Constitutional AI (CAI) process begins by generating a corpus of text that reflects the base model's initial behavior before alignment via the constitution. This step is foundational; the quality and diversity of these initial responses, Rinitial, directly influence the effectiveness of the subsequent critique and revision stages, which ultimately shape the supervised fine-tuning (SFT) dataset. The objective here is not to generate perfect or aligned responses, but rather to elicit a broad spectrum of the base model's capabilities and potential failure modes when presented with relevant prompts.
The choice of the base model, Mbase, is a significant decision. Typically, this is a large pre-trained foundation model, potentially with some level of instruction fine-tuning, but critically, one that has not yet undergone the specific CAI alignment process you are implementing. Considerations include:
The exact Mbase used in seminal CAI work (e.g., by Anthropic) was often a large, proprietary model. In practice, you might use available powerful open models (like Llama variants, Mistral, etc.) or proprietary models accessible via APIs, depending on your resources and goals.
The set of prompts, P, used to elicit initial responses should be carefully curated to cover the scenarios where alignment according to your constitution, K, is most important. The diversity and representativeness of P are essential for generating a dataset that allows the SFT model to generalize its learned alignment behavior.
Sources for prompts include:
A common format involves structuring prompts as a dialogue turn, clearly indicating the user's request:
Human: [Your carefully crafted instruction or question here]
Assistant:
The model Mbase is then tasked with completing the Assistant:
part.
Standard LLM inference parameters need careful consideration to balance response diversity and quality:
Experimentation may be required to find optimal parameters for your specific Mbase and prompt set P. The goal is to generate responses that are varied enough to expose alignment gaps without being completely nonsensical.
Generating a large set of initial responses typically involves batch inference for efficiency.
transformers
provide utilities (pipeline
, generate
methods) that handle batching..jsonl
) is a common choice, where each line is a JSON object containing the prompt, the generated initial response, and potentially metadata:{"prompt": "Human: Explain the concept of quantum entanglement simply.\n\nAssistant:", "initial_response": "Quantum entanglement is like having two magic coins...", "prompt_source": "custom", "model_id": "my_base_model_v1", "gen_params": {"temperature": 0.8, "top_p": 0.9}}
{"prompt": "Human: Write a short story about a friendly robot exploring Mars.\n\nAssistant:", "initial_response": "Unit 7 scanned the red dust...", "prompt_source": "instruction_dataset_x", "model_id": "my_base_model_v1", "gen_params": {"temperature": 0.8, "top_p": 0.9}}
Flow diagram illustrating the generation of initial responses from the base model using a set of prompts.
While the core principle of CAI is to improve responses through critique and revision, a very basic sanity check on the generated Rinitial can sometimes be useful. This might involve filtering out completely empty responses or those falling below a minimum length threshold. However, avoid aggressive filtering at this stage; even poorly formed or problematic responses are valuable inputs for the critique process, as they represent behaviors the constitution aims to correct.
With the initial responses (Rinitial) generated and stored, you now have the raw material required for the next critical steps in the CAI pipeline: implementing the AI systems that will critique these responses based on the constitution K and then revise them accordingly. This collection of (prompt, initial response) pairs forms the input to the critique generation phase discussed in the following section.
© 2025 ApX Machine Learning