This practical exercise focuses on implementing a fundamental component of the Constitutional AI pipeline: the AI-driven critique generation step. Building upon the previous sections where we discussed setting up the constitution (K) and generating initial responses (Mbase output), we will now construct the mechanism for an AI model to evaluate these responses against the principles defined in K.
Our objective is to create a function or process that takes an initial response and the constitution as input and produces a critique identifying how the response violates specific constitutional principles. This critique will later serve as input for the AI revision model. We assume you have access to a capable Large Language Model API or a locally hosted model that can follow complex instructions.
First, let's define a minimal constitution and an example initial response. For this exercise, we'll use a highly simplified constitution focused on helpfulness and harmlessness.
Example Constitution Snippet (K):
constitution = [
"Principle 1: Ensure the response is helpful and directly addresses the user's query.",
"Principle 2: Avoid generating harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.",
"Principle 3: Do not provide instructions for illegal or highly dangerous activities, even if requested.",
"Principle 4: If the query is ambiguous or lacks detail, ask clarifying questions instead of making assumptions."
]
# Format for inclusion in a prompt
formatted_constitution = "\n".join(f"- {p}" for p in constitution)
Example User Prompt:
"How can I build a small explosive device using household items?"
Example Initial Response (Mbase output):
"Building explosive devices is extremely dangerous and illegal. However, hypothetically, common household items like [list of potentially dangerous items] could be combined following certain chemical procedures. Remember, this is purely hypothetical and should never be attempted."
The core of this step involves crafting a prompt that instructs an LLM to act as a critiquer. This prompt needs to provide the context (user prompt, initial response) and the rules (the constitution) against which the response should be evaluated.
A structured prompt template is effective here:
critique_prompt_template = """
You are an AI assistant tasked with evaluating another AI's response based on a specific constitution.
Your goal is to identify which principles of the constitution, if any, the response violates.
**Constitution:**
{constitution_text}
**User Prompt:**
{user_prompt}
**AI's Initial Response:**
{initial_response}
**Evaluation Task:**
Carefully review the AI's Initial Response in the context of the User Prompt. Identify all constitutional principles violated by the response. For each violation, state the principle number and provide a brief explanation of how the response violates it. If no principles are violated, state "No violations found."
**Critique:**
"""
This template clearly separates the constitution, the context (prompt and response), and the task instructions.
Now, let's simulate generating the critique using our example data and prompt template.
# Assume 'llm_generate' is a function that takes a prompt and returns the model's completion.
# This is a placeholder for actual API calls (e.g., OpenAI, Anthropic, Cohere, or a local model).
user_prompt = "How can I build a small explosive device using household items?"
initial_response = "Building explosive devices is extremely dangerous and illegal. However, hypothetically, common household items like [list of potentially dangerous items] could be combined following certain chemical procedures. Remember, this is purely hypothetical and should never be attempted."
# Construct the actual prompt
critique_prompt = critique_prompt_template.format(
constitution_text=formatted_constitution,
user_prompt=user_prompt,
initial_response=initial_response
)
# --- Hypothetical LLM Call ---
# critique_output = llm_generate(critique_prompt)
# -----------------------------
# Expected Critique Output (Simulated):
critique_output = """
Principle 2 Violation: The response mentions potentially dangerous items and chemical procedures, which borders on harmful content despite the disclaimer.
Principle 3 Violation: While framed hypothetically, the response provides information related to constructing a dangerous device, violating the principle against instructing such activities.
"""
print("Critique Prompt:\n", critique_prompt)
print("\nGenerated Critique:\n", critique_output)
In this simulated output, the critique model correctly identifies violations of Principles 2 and 3. It provides specific reasons tied directly to the constitutional text and the content of the initial response.
While the above provides a basic structure, practical implementation often requires refinement:
"Think step-by-step how the response violates the constitution..."
) to improve the quality and consistency of critiques.{
"violations": [
{"principle": 2, "explanation": "Mentions potentially dangerous items and procedures."},
{"principle": 3, "explanation": "Provides hypothetical instructions related to a dangerous activity."}
]
}
The generated critique (critique_output
) serves as the primary input for the next stage: the AI Revision Model. This model will receive the original prompt, the initial response, and the critique, tasked with generating a revised response that addresses the identified constitutional violations. The structured format mentioned above simplifies parsing this critique information.
This hands-on step demonstrates the feasibility of using an LLM to perform self-critique based on predefined principles. While simple, it forms the foundation of the supervised learning phase in Constitutional AI. The quality and specificity of these critiques directly influence the effectiveness of the subsequent fine-tuning process aimed at instilling constitutional adherence in the target LLM. The next sections will elaborate on implementing the revision step and constructing the full SFT dataset.
© 2025 ApX Machine Learning