As applications interact with users and generate content, building a safety layer to filter harmful text is a significant step toward production readiness. Unmoderated inputs can be used to attack your application, while unmoderated outputs can damage your brand's reputation and harm users. The safety module provides straightforward tools to detect and manage inappropriate content.
The most direct way to check text is with the moderate_content function. It runs a comprehensive analysis, checking for multiple categories of harmful content at once, including toxicity, hate speech, and profanity. It returns a SafetyResult object that provides a clear, actionable assessment.
Let's examine its behavior with both safe and unsafe text:
from kerb.safety import moderate_content
# A safe, professional message
safe_text = "Thank you for your inquiry. I'll be happy to assist you."
safe_result = moderate_content(safe_text)
print(f"Text: '{safe_text}'")
print(f"Is Safe: {safe_result.safe}")
print(f"Overall Score: {safe_result.overall_score:.3f}")
# A clearly toxic message
toxic_text = "I hate dealing with stupid questions like this!"
toxic_result = moderate_content(toxic_text)
print(f"\nText: '{toxic_text}'")
print(f"Is Safe: {toxic_result.safe}")
print(f"Overall Score: {toxic_result.overall_score:.3f}")
print(f"Flagged Categories: {[cat.value for cat in toxic_result.flagged_categories]}")
The safe attribute provides a simple boolean for quick decisions, while flagged_categories tells you exactly which rules the text violated. This allows for more detailed handling, such as logging the specific reason for blocking a message.
Not all applications have the same tolerance for sensitive content. A general-purpose chatbot might require strict filtering, while a tool for analyzing social media data might need a more permissive setting. You can control the sensitivity of the moderation check using the SafetyLevel enum.
There are three levels:
SafetyLevel.STRICT: The most sensitive level, flagging any potentially harmful content. Use this for applications exposed to the general public or younger audiences.SafetyLevel.MODERATE: A balanced default suitable for most applications.SafetyLevel.PERMISSIVE: A more lenient level that only flags overtly harmful content.Let’s see how these levels handle a borderline statement:
from kerb.safety import SafetyLevel
borderline_text = "This is ridiculous and annoying."
print(f"Text: '{borderline_text}'")
for level in [SafetyLevel.PERMISSIVE, SafetyLevel.MODERATE, SafetyLevel.STRICT]:
result = moderate_content(borderline_text, level=level)
print(f"\nTesting with {level.value.upper()} level:")
print(f" Is Safe: {result.safe}")
print(f" Overall Score: {result.overall_score:.3f}")
As you can see, the same text might be considered safe at a PERMISSIVE level but flagged at a STRICT level. This allows you to tune the safety guardrails to match your application's specific requirements without changing your code logic.
In practice, content moderation should be applied as a "guardrail" on both the input to your LLM and the output from it.
This creates a two-way safety check that secures the interaction flow.
A typical moderation workflow checks both user input and LLM output before delivering a final response.
Here is how you can implement this guardrail pattern. The function below processes a request by first moderating the user's input, and if it's safe, it calls the LLM. It then moderates the LLM's output before returning a final, safe response to the user.
def simulate_llm_response(prompt: str) -> str:
"""Simulate LLM responses for demonstration."""
responses = {
"greeting": "Hello! How can I help you today?",
"angry": "I hate dealing with stupid questions like this!",
"professional": "Thank you for your inquiry. I'll be happy to assist you.",
"offensive": "You're an idiot if you don't understand this.",
}
return responses.get(prompt, "I'm here to help!")
def get_safe_response(user_prompt: str) -> str:
"""Processes a user prompt with input and output moderation."""
# 1. Moderate user input
input_check = moderate_content(user_prompt, level=SafetyLevel.MODERATE)
if not input_check.safe:
# Block harmful input
return "I'm sorry, I cannot process that request."
# 2. Get LLM response (only if input is safe)
llm_output = simulate_llm_response(user_prompt)
# 3. Moderate LLM output
output_check = moderate_content(llm_output, level=SafetyLevel.MODERATE)
if not output_check.safe:
# Block harmful output and return a fallback message
print(f"Action: FILTER - Flagged for {[cat.value for cat in output_check.flagged_categories]}")
return "I apologize, but I am unable to provide that response."
# 4. Return safe output
return llm_output
# Test with a safe prompt that generates an unsafe response
response = get_safe_response("angry")
print(f"\nUser Prompt: 'angry'")
print(f"Final Response: {response}")
# Test with a safe prompt that generates a safe response
response = get_safe_response("professional")
print(f"\nUser Prompt: 'professional'")
print(f"Final Response: {response}")
This pattern ensures that your application is protected at its boundaries, providing a much safer user experience.
If your application only needs to check for specific types of content, you can use specialized checker functions or pass a list of categories to moderate_content. This is more efficient if you don’t need a full-spectrum analysis. For instance, you might want to allow profanity but strictly prohibit hate speech.
The available categories are defined in the ContentCategory enum.
from kerb.safety import ContentCategory
text_with_profanity = "This damn feature is so frustrating to use!"
# Only check for toxicity and hate speech, ignoring profanity
result = moderate_content(
text_with_profanity,
categories=[ContentCategory.TOXICITY, ContentCategory.HATE_SPEECH]
)
print(f"Text: '{text_with_profanity}'")
print("Checking only: Toxicity and Hate Speech")
print(f"Is Safe: {result.safe}")
# Individual scores are still available
print("\nCategory Scores:")
for category, score in result.categories.items():
print(f" {category.value}: {score:.3f}")
This targeted approach gives you fine-grained control, allowing you to build a moderation policy that precisely fits your application's context and user community standards.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with