When an LLM agent decides to use one of your custom Python tools, it will attempt to provide the necessary inputs based on the tool's description and its understanding of the task. However, LLMs, especially when generating structured data like JSON for tool inputs, can sometimes produce values that are unexpected, malformed, or even potentially unsafe. Therefore, rigorously validating and sanitizing any input your tool receives before acting upon it is a fundamental aspect of building reliable and secure agentic systems. This isn't just about preventing crashes; it's about ensuring predictable behavior and protecting your systems.
Think of input validation and sanitization as two distinct but complementary security checkpoints for data entering your tool.
The following diagram illustrates where validation and sanitization fit into the tool invocation flow:
Flow of input processing within an LLM tool, highlighting validation and sanitization steps.
Effective input validation acts as the first line of defense. If the input doesn't meet the basic requirements, there's often no point in proceeding to sanitization or execution.
At the most fundamental level, Python's built-in capabilities can help.
isinstance()
to ensure an input is of the expected Python type (e.g., str
, int
, list
).
def process_item_id(item_id: int):
if not isinstance(item_id, int):
raise TypeError("Item ID must be an integer.")
# ... further processing
def set_brightness(level: int):
if not (0 <= level <= 100):
raise ValueError("Brightness level must be between 0 and 100.")
# ... set brightness
def submit_comment(text: str):
if not (10 <= len(text) <= 1000):
raise ValueError("Comment must be between 10 and 1000 characters.")
# ... submit comment
For inputs that must follow a specific format (e.g., email addresses, specific ID patterns, dates), regular expressions are a powerful tool. Python's re
module provides the necessary functions.
import re
def validate_product_code(code: str):
pattern = r"^[A-Z]{3}-\d{5}$" # e.g., ABC-12345
if not re.match(pattern, code):
raise ValueError("Product code format is invalid. Expected format: XXX-NNNNN")
return True
While flexible, complex regular expressions can become difficult to maintain. Use them judiciously for clear, well-defined patterns.
For tools that expect structured inputs, especially JSON-like objects (which are common for LLM tool arguments), libraries like Pydantic can significantly simplify validation. Pydantic uses Python type hints to define data schemas and performs validation, data parsing, and error reporting.
Consider a tool that requires a search query and an optional maximum number of results:
from pydantic import BaseModel, Field, validator, ValidationError
from typing import Optional
class SearchToolInput(BaseModel):
query: str = Field(..., min_length=3, max_length=200, description="The search query string.")
max_results: Optional[int] = Field(default=10, gt=0, le=50, description="Maximum number of results to return.")
@validator('query')
def query_must_be_alphanumeric_or_spaces(cls, v):
# A custom validator to allow only alphanumeric characters and spaces
if not re.match(r"^[a-zA-Z0-9\s]+$", v):
raise ValueError('Query can only contain alphanumeric characters and spaces.')
return v.strip() # Also a light form of sanitization: trim whitespace
# Example usage within your tool
def search_documents(raw_input: dict):
try:
validated_input = SearchToolInput(**raw_input)
# Now use validated_input.query and validated_input.max_results
print(f"Searching for: '{validated_input.query}' with max_results: {validated_input.max_results}")
# ... actual search logic ...
return {"status": "success", "results": []} # Placeholder
except ValidationError as e:
# Pydantic's error messages are quite informative
return {"status": "error", "message": f"Input validation failed: {e.errors()}"}
# Simulating LLM input
llm_provided_input_valid = {"query": " Python best practices ", "max_results": 5}
llm_provided_input_invalid_query = {"query": "Python!", "max_results": 5} # Contains '!'
llm_provided_input_invalid_max_results = {"query": "Data Science", "max_results": 100} # > 50
print(search_documents(llm_provided_input_valid))
print(search_documents(llm_provided_input_invalid_query))
print(search_documents(llm_provided_input_invalid_max_results))
Using Pydantic offers several advantages:
Sanitization is about transforming input to ensure it's safe, especially when that input might be used in sensitive contexts like database queries, shell commands, or HTML output. The core principle is to treat all input from external sources (including an LLM) as untrusted.
Your tools should operate with the minimum necessary permissions. If a tool writes to a file, ensure it can only write to intended, safe locations. If it queries a database, use a database user with restricted permissions (e.g., read-only access to specific tables if that's all it needs).
SQL Injection: If your tool constructs SQL queries using input data, it's highly vulnerable.
# BAD: Vulnerable to SQL Injection
# cursor.execute(f"SELECT * FROM users WHERE username = '{username}'")
# GOOD: Using parameterized query (example with sqlite3)
# import sqlite3
# conn = sqlite3.connect(':memory:')
# cursor = conn.cursor()
# username = "admin" # Potentially from LLM
# cursor.execute("SELECT * FROM users WHERE username = ?", (username,))
Command Injection: If your tool needs to execute shell commands, directly inserting user input into command strings is extremely dangerous.
subprocess.run()
with the command and arguments passed as a list, not a single string.shlex
offer functions (e.g., shlex.quote()
) to escape characters in strings intended for shell use, but this should be a secondary defense to careful command construction.import subprocess
import shlex
def list_directory_contents(user_path: str):
# Validate path (e.g., ensure it's within an allowed base directory)
# For simplicity, this example omits complex path validation.
# Assume user_path is something like "my_safe_subdir/data"
# Sanitize for shell argument if constructing a string (less safe)
# safe_path_arg = shlex.quote(user_path)
# command_string = f"ls -l {safe_path_arg}"
# result = subprocess.run(command_string, shell=True, capture_output=True, text=True)
# Better: Pass arguments as a list (more secure)
try:
# Here, user_path should ideally be validated to prevent '..' or absolute paths
# For this example, let's assume basic validation occurred.
# If user_path could contain malicious elements for ls itself, more checks are needed.
# Example: user_path = "."
# Example: user_path = "some_dir"
# Avoid user_path = "-la" or similar that could be interpreted as options.
# A strict allow-list for path components is even better.
if not re.match(r"^[a-zA-Z0-9_/\.-]+$", user_path) or ".." in user_path:
return {"status": "error", "message": "Invalid path characters or path traversal attempt."}
result = subprocess.run(["ls", "-l", user_path], capture_output=True, text=True, check=True)
return {"status": "success", "output": result.stdout}
except subprocess.CalledProcessError as e:
return {"status": "error", "message": f"Command failed: {e.stderr}"}
except FileNotFoundError:
return {"status": "error", "message": "ls command not found or invalid path."}
# print(list_directory_contents(".")) # Safe
# print(list_directory_contents("nonexistent_dir; rm -rf /")) # The validation should catch this.
# If not, shlex.quote or list-based args prevent ; rm -rf
HTML/Script Injection (Cross-Site Scripting - XSS): If your tool's output might be rendered in a web browser, and that output includes LLM-generated input, you must sanitize it to prevent XSS.
bleach
are designed to clean HTML, allowing only a safe subset of tags and attributes.import bleach
def format_comment_for_html(user_comment: str):
# Allow only bold and italic tags, and escape all other HTML
allowed_tags = ['b', 'i']
sanitized_comment = bleach.linkify(bleach.clean(user_comment, tags=allowed_tags, strip=True))
return f"<p>{sanitized_comment}</p>"
# unsafe_comment = "This is <b>great</b>! <script>alert('XSS')</script>"
# print(format_comment_for_html(unsafe_comment))
# Output: <p>This is <b>great</b>! <script>alert('XSS')</script></p>
When possible, define what is allowed (an allow-list) rather than trying to list everything that is not allowed (a deny-list). Deny-lists are notoriously difficult to get right and are often circumvented as new attack vectors are discovered. For example, for a parameter that expects a country code, validate against a known list of valid country codes.
When validation or sanitization rules are violated, your tool shouldn't just crash or proceed with tainted data. It needs to:
For instance, if Pydantic validation fails, its ValidationError
contains structured information about the errors. You can format this into a string that the LLM can parse or understand.
# (Continuing the Pydantic SearchToolInput example)
# ...
# except ValidationError as e:
# error_details = e.errors() # This is a list of dicts
# # Construct a user-friendly message for the LLM
# messages = []
# for error in error_details:
# field = " -> ".join(map(str, error['loc'])) # loc can be a path for nested models
# msg = error['msg']
# messages.append(f"Field '{field}': {msg}")
# error_summary = "Input validation failed. " + "; ".join(messages)
# return {"status": "error", "message": error_summary, "details": error_details}
# llm_provided_input_very_wrong = {"query": "Q", "max_results": 200, "extra_field": "test"}
# result = search_documents(llm_provided_input_very_wrong)
# print(json.dumps(result, indent=2))
# Expected output might look like:
# {
# "status": "error",
# "message": "Input validation failed. Field 'query': ensure this value has at least 3 characters; Field 'max_results': ensure this value is less than or equal to 50; Field 'extra_field': extra fields not permitted",
# "details": [
# { /* Pydantic error details */ }
# ]
# }
The goal is to provide enough information for the LLM (or a developer monitoring the agent) to understand the input requirements better.
By diligently applying input validation and sanitization, you build a foundation of trust and reliability for your Python tools. This not only prevents errors and security vulnerabilities but also contributes to a more predictable and effective interaction between the LLM agent and its extended capabilities. Remember that any input originating from outside your direct control, including that generated by an LLM, requires careful scrutiny.
Was this section helpful?
© 2025 ApX Machine Learning