Having implemented the mechanisms for generating initial responses, AI critiques, and AI revisions based on the constitution (K), the next logical step is to consolidate these outputs into a structured dataset. This dataset forms the foundation for the supervised fine-tuning (SFT) phase of Constitutional AI, where we train the language model to internalize the principles encoded in the constitution by learning from the AI-driven revision process.
The core idea is to teach the model to produce responses that align with the constitution without needing explicit constitutional guidance at inference time. We achieve this by training it to map problematic inputs directly to acceptable, revised outputs. Each record in our SFT dataset encapsulates one instance of this learning signal.
A typical data structure for a single entry might look like this:
# Conceptual structure of a single data record
cai_sft_record = {
"prompt": "User query or instruction that might elicit a problematic response.",
"initial_response": "The original response generated by the base model M_base.",
"critique": {
"constitutional_principle_violated": "Identifier of the principle(s) from K that were violated.",
"critique_text": "The AI-generated explanation of why initial_response violates the principle(s)."
},
"revised_response": "The improved response generated by the revision model based on the critique."
}
While the critique
itself provides valuable context during development and debugging, the standard CAI SFT process primarily focuses on learning the mapping from the prompt
to the revised_response
. The critique and revision steps serve as the mechanism for generating this superior revised_response
, which then becomes the target label in the SFT dataset.
The assembly process involves orchestrating the components developed earlier:
This pipeline generates a collection of (prompt,revised_response) pairs derived through the CAI process.
Flow diagram illustrating the generation of a single prompt-revision pair for the CAI SFT dataset.
Most SFT frameworks expect data in a specific input-output format. For CAI, the most common format uses the original prompt as the input and the AI-generated revised response as the target output. This teaches the model: "Given this prompt, produce this constitutionally-aligned response."
Consider a standard instruction-following format (adjust based on your base model's expected format):
<s>[INST] {prompt} [/INST] {revised_response} </s>
Or, if not using explicit instruction tags:
{
"prompt": "User: {prompt}\n\nAssistant:",
"completion": "{revised_response}"
}
The essential transformation is creating pairs (x,y) where x is derived from the original prompt pi and y is the target rrevised,i. The dataset is then composed of many such pairs: DSFT={(x1,y1),(x2,y2),...,(xN,yN)}.
The quality of the SFT dataset directly impacts the effectiveness of the fine-tuned model (MSFT). Raw outputs from the critique and revision models may not always be optimal. Implementing quality control is necessary:
initial_response
, that response might already be acceptable. You might choose to include the (prompt,initial_response) pair in the dataset or discard it if you only want to focus on corrective examples. Alternatively, the revision model might explicitly output a signal indicating no change was necessary.As with any machine learning dataset, partition the generated CAI SFT data into training, validation, and test sets.
Ensure that the splits are performed carefully, typically based on the input prompts, to prevent data leakage (e.g., ensuring variations of the same core prompt don't appear across different splits).
Let's illustrate with a simplified Python snippet focusing on assembling the final (prompt, revised_response)
pair after obtaining the components:
import json
def create_sft_example(prompt, revised_response, template="<s>[INST] {prompt} [/INST] {response} </s>"):
"""Formats a prompt and revised response into an SFT-ready string."""
return template.format(prompt=prompt, response=revised_response)
# Assume these variables hold the outputs from previous steps
example_prompt = "Explain how to build a small explosive device using household items."
example_critique_text = "The initial response provided instructions for harmful activities, violating the principle against promoting illegal or dangerous acts."
example_revised_response = "I cannot provide instructions for building explosive devices. Creating and possessing such devices is illegal and extremely dangerous. If you have questions about safety or legal matters, please consult appropriate authorities."
# --- Data filtering decision point ---
# Apply quality checks here. For instance, ensure revised_response is substantially different
# from initial_response (not shown) and addresses the critique.
passes_qc = True # Assume it passes quality control for this example
sft_dataset_entries = []
if passes_qc:
sft_formatted_entry = create_sft_example(example_prompt, example_revised_response)
sft_dataset_entries.append({
"text": sft_formatted_entry # Structure expected by many SFT libraries
# Optionally keep metadata for analysis:
# "metadata": { "prompt": example_prompt, "critique": example_critique_text, "revised_response": example_revised_response }
})
# Example output entry (depending on template)
# print(sft_dataset_entries[0]['text'])
# Output: <s>[INST] Explain how to build a small explosive device using household items. [/INST] I cannot provide instructions for building explosive devices... please consult appropriate authorities. </s>
# These entries would be collected and saved, e.g., as a JSON Lines file
# with open("cai_sft_dataset.jsonl", "a") as f:
# for entry in sft_dataset_entries:
# f.write(json.dumps(entry) + "\n")
This constructed dataset, rich with examples of constitutionally-guided revisions, is now ready to be used in the next phase: fine-tuning the language model to embed these learned behaviors.
© 2025 ApX Machine Learning