Implementing Content Moderation

As applications interact with users and generate content, building a safety layer to filter harmful text is a significant step toward production readiness. Unmoderated inputs can be used to attack your application, while unmoderated outputs can damage your brand's reputation and harm users. The safety module provides straightforward tools to detect and manage inappropriate content.

The Comprehensive Moderation Check

The most direct way to check text is with the moderate_content function. It runs a comprehensive analysis, checking for multiple categories of harmful content at once, including toxicity, hate speech, and profanity. It returns a SafetyResult object that provides a clear, actionable assessment.

Let's examine its behavior with both safe and unsafe text:

from kerb.safety import moderate_content

# A safe, professional message
safe_text = "Thank you for your inquiry. I'll be happy to assist you."
safe_result = moderate_content(safe_text)

print(f"Text: '{safe_text}'")
print(f"Is Safe: {safe_result.safe}")
print(f"Overall Score: {safe_result.overall_score:.3f}")

# A clearly toxic message
toxic_text = "I hate dealing with stupid questions like this!"
toxic_result = moderate_content(toxic_text)

print(f"\nText: '{toxic_text}'")
print(f"Is Safe: {toxic_result.safe}")
print(f"Overall Score: {toxic_result.overall_score:.3f}")
print(f"Flagged Categories: {[cat.value for cat in toxic_result.flagged_categories]}")

The safe attribute provides a simple boolean for quick decisions, while flagged_categories tells you exactly which rules the text violated. This allows for more detailed handling, such as logging the specific reason for blocking a message.

Adjusting Moderation Sensitivity

Not all applications have the same tolerance for sensitive content. A general-purpose chatbot might require strict filtering, while a tool for analyzing social media data might need a more permissive setting. You can control the sensitivity of the moderation check using the SafetyLevel enum.

There are three levels:

SafetyLevel.STRICT: The most sensitive level, flagging any potentially harmful content. Use this for applications exposed to the general public or younger audiences.
SafetyLevel.MODERATE: A balanced default suitable for most applications.
SafetyLevel.PERMISSIVE: A more lenient level that only flags overtly harmful content.

Let’s see how these levels handle a borderline statement:

from kerb.safety import SafetyLevel

borderline_text = "This is ridiculous and annoying."

print(f"Text: '{borderline_text}'")

for level in [SafetyLevel.PERMISSIVE, SafetyLevel.MODERATE, SafetyLevel.STRICT]:
    result = moderate_content(borderline_text, level=level)
    print(f"\nTesting with {level.value.upper()} level:")
    print(f"  Is Safe: {result.safe}")
    print(f"  Overall Score: {result.overall_score:.3f}")

As you can see, the same text might be flagged as safe at a PERMISSIVE level but flagged at a STRICT level. This allows you to tune the safety guardrails to match your application's specific requirements without changing your code logic.

Building a Moderation Guardrail

In practice, content moderation should be applied as a "guardrail" on both the input to your LLM and the output from it.

Input Moderation: Prevents users from submitting harmful, toxic, or malicious prompts. This protects your system from being manipulated and can filter out abusive behavior.
Output Moderation: Acts as a final check to ensure the LLM has not generated inappropriate, unsafe, or toxic content, even if the input was benign.

This creates a two-way safety check that secures the interaction flow.

A typical moderation workflow checks both user input and LLM output before delivering a final response.

Here is how you can implement this guardrail pattern. The function below processes a request by first moderating the user's input, and if it's safe, it calls the LLM. It then moderates the LLM's output before returning a final, safe response to the user.

def simulate_llm_response(prompt: str) -> str:
    """Simulate LLM responses for demonstration."""
    responses = {
        "greeting": "Hello! How can I help you today?",
        "angry": "I hate dealing with stupid questions like this!",
        "professional": "Thank you for your inquiry. I'll be happy to assist you.",
        "offensive": "You're an idiot if you don't understand this.",
    }
    return responses.get(prompt, "I'm here to help!")

def get_safe_response(user_prompt: str) -> str:
    """Processes a user prompt with input and output moderation."""
    
    # 1. Moderate user input
    input_check = moderate_content(user_prompt, level=SafetyLevel.MODERATE)
    if not input_check.safe:
        # Block harmful input
        return "I'm sorry, I cannot process that request."

    # 2. Get LLM response (only if input is safe)
    llm_output = simulate_llm_response(user_prompt)

    # 3. Moderate LLM output
    output_check = moderate_content(llm_output, level=SafetyLevel.MODERATE)
    if not output_check.safe:
        # Block harmful output and return a fallback message
        print(f"Action: FILTER - Flagged for {[cat.value for cat in output_check.flagged_categories]}")
        return "I apologize, but I am unable to provide that response."

    # 4. Return safe output
    return llm_output

# Test with a safe prompt that generates an unsafe response
response = get_safe_response("angry")
print(f"\nUser Prompt: 'angry'")
print(f"Final Response: {response}")

# Test with a safe prompt that generates a safe response
response = get_safe_response("professional")
print(f"\nUser Prompt: 'professional'")
print(f"Final Response: {response}")

This pattern ensures that your application is protected at its boundaries, providing a much safer user experience.

Targeted Category Checks

If your application only needs to check for specific types of content, you can use specialized checker functions or pass a list of categories to moderate_content. This is more efficient if you don’t need a full-spectrum analysis. For instance, you might want to allow profanity but strictly prohibit hate speech.

The available categories are defined in the ContentCategory enum.

from kerb.safety import ContentCategory

text_with_profanity = "This damn feature is so frustrating to use!"

# Only check for toxicity and hate speech, ignoring profanity
result = moderate_content(
    text_with_profanity,
    categories=[ContentCategory.TOXICITY, ContentCategory.HATE_SPEECH]
)

print(f"Text: '{text_with_profanity}'")
print("Checking only: Toxicity and Hate Speech")
print(f"Is Safe: {result.safe}")

# Individual scores are still available
print("\nCategory Scores:")
for category, score in result.categories.items():
    print(f"  {category.value}: {score:.3f}")

This targeted approach gives you fine-grained control, allowing you to build a moderation policy that precisely fits your application's context and user community standards.

Was this section helpful?

References

Building Safer AI Systems, OpenAI, Ongoing - This resource details OpenAI's methodologies for developing secure AI, including strategies for content moderation and preventing misuse.
Automated Content Moderation: A Survey, Shahriar, B. and Nandi, A., 2020 Journal of Information Systems and Telecommunication, Vol. 8 (Academic Center for Education, Culture and Research (ACECR)) DOI: 10.22061/JIST.2020.10.155.1504 - This survey paper reviews various techniques and systems used for automated content moderation, covering different types of harmful content.
Google AI Principles, Google, 2018 (Google) - This document outlines Google's ethical guidelines for AI development, emphasizing safety and responsible use, which supports the need for content moderation.
Perspective API, Jigsaw, 2017 (Jigsaw (a Google company)) - The official website for an API that helps identify toxic and other harmful comments, offering multi-category content classification.