All Courses

Input Validation and Sanitization

As discussed in the chapter introduction, the interaction boundary between your LangChain application and the external environment, primarily through user input, is a significant area for security focus. LLM applications often process unstructured natural language, which inherently lacks the strict schemas of traditional software inputs. This flexibility, while powerful, opens avenues for malicious actors to manipulate the application's behavior. Input validation and sanitization are therefore foundational practices to mitigate these risks before potentially harmful data reaches your LLMs, tools, or backend systems.

Unlike traditional applications where validation often focuses on data types and formats, validation in LLM contexts must also consider the semantic content and its potential to influence the language model or downstream components in unintended ways.

The Importance of Input Validation in LLM Applications

Failing to properly validate and sanitize inputs can lead to several security vulnerabilities specific to LLM applications:

Prompt Injection: This is arguably the most prominent threat. Malicious input can contain instructions intended to override the original system prompt, causing the LLM to ignore its intended task, reveal sensitive configuration details, or perform actions dictated by the attacker. Validation can help detect or block patterns commonly used in injection attacks.
Insecure Tool Arguments: If your LangChain agent uses tools that interact with external systems (databases, APIs, file systems), unvalidated input passed as arguments to these tools can lead directly to classic vulnerabilities like SQL Injection, Cross-Site Scripting (XSS), Server-Side Request Forgery (SSRF), or Remote Code Execution (RCE). Sanitizing or strictly validating tool inputs is essential.
Data Exfiltration: Cleverly crafted inputs might trick the LLM into revealing sensitive data it has access to, either from its training data (less common with modern alignment techniques) or, more critically, from the application's context (e.g., retrieved documents, conversation history).
Denial of Service (DoS): Extremely long, complex, or resource-intensive inputs could overload the LLM, associated tools, or parsing logic, leading to excessive costs or service unavailability. Length and complexity checks are important preventative measures.
Biased or Harmful Content Generation: While often addressed through model alignment and output filtering, input validation can serve as an initial check to block inherently toxic, biased, or policy-violating prompts before they are processed.

Where to Apply Validation and Sanitization

Effective input validation requires identifying all points where external data enters your system and applying checks as early as possible. Common integration points in a LangChain application include:

Initial User Interface/API Boundary: The very first point where user input (e.g., chat messages, form data) enters your application.
Before LLM Calls: Validate the final prompt constructed from templates and user input before sending it to the LLM.
Before Tool Execution: Critically important. Validate and sanitize arguments just before they are passed to any tool's execution logic.
Data Loading/Retrieval: If ingesting external documents for RAG, consider validating or sanitizing the content during the loading and chunking process to prevent indirect prompt injection from the data source.

The following diagram illustrates typical points for input validation within a request flow involving an agent:

This flow shows validation applied immediately after receiving user input and again specifically before executing a tool with potentially unsafe arguments derived from user input or LLM output.

Techniques for Validation and Sanitization

Choose techniques appropriate for the specific input and the potential risks. A combination is often necessary:

Type Checking: Ensure the input data conforms to expected Python types (e.g., str, int, list, bool). This is a basic but fundamental check, especially for tool arguments.

Length Constraints: Limit the minimum and maximum length of strings or the size of collections. This prevents overly verbose inputs that might bypass other checks or cause DoS.

MAX_INPUT_LENGTH = 1024

def validate_length(user_input: str):
    if not (0 < len(user_input) <= MAX_INPUT_LENGTH):
        raise ValueError(f"Input length must be between 1 and {MAX_INPUT_LENGTH} characters.")
    return user_input

Format and Schema Validation (Pydantic): For structured data or inputs needing specific formats (emails, URLs), use libraries like Pydantic. Pydantic models are excellent for defining expected schemas for tool inputs, providing automatic validation. LangChain integrates smoothly with Pydantic for defining tool args_schema.

from pydantic import BaseModel, Field, validator
import re

class SearchToolSchema(BaseModel):
    query: str = Field(..., description="Search query", min_length=3, max_length=150)
    max_results: int = Field(default=5, gt=0, le=20)

    @validator('query')
    def query_safety_check(cls, v):
        # Example: Prevent obvious command-like structures (adjust regex as needed)
        if re.search(r'[;&|`$()]', v):
             raise ValueError("Query contains potentially unsafe characters.")
        # Example: Block specific keywords (use with caution, can be bypassed)
        if "DROP TABLE" in v.upper():
             raise ValueError("Potentially harmful SQL keyword detected.")
        return v.strip() # Also perform basic sanitization

Allow-listing (Inclusion Lists): Define exactly what is permitted (e.g., specific characters, known commands, enum values). This is generally more secure than block-listing, as it rejects anything not explicitly allowed.
Block-listing (Exclusion Lists): Define what is not permitted (e.g., specific keywords like password, admin, script tags <script>). Block lists are notoriously difficult to maintain and easy for attackers to bypass with encoding or obfuscation techniques. Use them sparingly as a secondary defense layer.
Regular Expressions: Useful for enforcing specific patterns (e.g., alphanumeric only, specific command structures) or detecting known malicious patterns (though see block-listing limitations).
Sanitization: Modify the input to remove or neutralize potentially harmful elements, rather than rejecting it outright. Use with care, as it can alter the input's meaning.
- Escaping: Convert special characters into safe equivalents (e.g., HTML escaping > to >). Essential when input might be rendered in another context (web page, SQL query). Always prefer parameterized queries for SQL over manual escaping.
- Stripping/Removing: Delete disallowed characters or tags. For example, removing HTML tags from user input if it's only expected to be plain text.
- Normalization: Convert input to a canonical form (e.g., lowercase, remove extra whitespace, Unicode normalization). Can help make other validation rules more effective.

Implementing Validation in LangChain

You can integrate these techniques into your LangChain applications in several ways:

Directly in Tool Code: Add validation logic within the _run or _arun methods of your custom tools, or use Pydantic args_schema for automatic validation upon tool invocation.

from langchain.tools import BaseTool, StructuredTool
from pydantic import BaseModel, Field
from typing import Type

# Using Pydantic schema for validation
class CalculatorInput(BaseModel):
    expression: str = Field(..., description="Mathematical expression to evaluate")
    # Add more specific validators here if needed

class SafeCalculatorTool(BaseTool):
    name = "safe_calculator"
    description = "Safely evaluates a mathematical expression."
    args_schema: Type[BaseModel] = CalculatorInput

    def _run(self, expression: str):
        # Further validation/sanitization specific to evaluation can happen here
        # For example, use a safe evaluation library instead of eval()
        try:
            # Use a safer alternative like asteval or numexpr
            import numexpr
            # Sanitize further if needed, e.g., limit allowed functions/variables
            result = numexpr.evaluate(expression).item()
            return result
        except Exception as e:
            return f"Error evaluating expression: {e}"
    # Implement _arun for async if needed

Using RunnableLambda in LCEL: Create custom validation or sanitization functions and wrap them as RunnableLambda components to insert them into your LangChain Expression Language (LCEL) chains.

from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

def validate_and_sanitize(input_data):
    # Assuming input_data is a dictionary containing user_query
    query = input_data.get("user_query", "")
    if len(query) > 500:
         raise ValueError("Query exceeds maximum length of 500 characters.")
    # Basic sanitization
    sanitized_query = query.strip()
    # Add more checks here...
    input_data["user_query"] = sanitized_query
    return input_data # Pass through the potentially modified data

validation_step = RunnableLambda(validate_and_sanitize)
llm = ChatOpenAI(model="gpt-3.5-turbo") # Replace with your LLM
prompt = ChatPromptTemplate.from_template("Answer the user's question: {user_query}")

# Chain incorporating validation early
chain = (
    RunnablePassthrough() # Start with input dictionary
    | validation_step
    | prompt
    | llm
    # | output_parser ...
)

# Example invocation
try:
    # result = chain.invoke({"user_query": " Tell me about LangChain. "})
    # result_long = chain.invoke({"user_query": "A" * 1000}) # This would raise ValueError
    pass # Placeholder for actual invocation
except ValueError as e:
    print(f"Validation failed: {e}")

Custom Chain Components: For more complex validation logic involving state or multiple steps, implement a custom Runnable class.

Input validation and sanitization are not silver bullets but form a critical layer in a defense-in-depth security strategy for LangChain applications. By carefully considering potential threats and applying appropriate checks at critical integration points, you can significantly reduce the attack surface of your LLM-powered systems. Remember to tailor your validation rules to the specific context and sensitivity of the data and actions involved.

Was this section helpful?