As discussed in the chapter introduction, the interaction boundary between your LangChain application and the external world, primarily through user input, is a significant area for security focus. LLM applications often process unstructured natural language, which inherently lacks the strict schemas of traditional software inputs. This flexibility, while powerful, opens avenues for malicious actors to manipulate the application's behavior. Input validation and sanitization are therefore foundational practices to mitigate these risks before potentially harmful data reaches your LLMs, tools, or backend systems.
Unlike traditional applications where validation often focuses on data types and formats, validation in LLM contexts must also consider the semantic content and its potential to influence the language model or downstream components in unintended ways.
Failing to properly validate and sanitize inputs can lead to several security vulnerabilities specific to LLM applications:
Effective input validation requires identifying all points where external data enters your system and applying checks as early as possible. Common integration points in a LangChain application include:
The following diagram illustrates typical points for input validation within a request flow involving an agent:
This flow shows validation applied immediately after receiving user input and again specifically before executing a tool with potentially unsafe arguments derived from user input or LLM output.
Choose techniques appropriate for the specific input and the potential risks. A combination is often necessary:
str
, int
, list
, bool
). This is a basic but fundamental check, especially for tool arguments.MAX_INPUT_LENGTH = 1024
def validate_length(user_input: str):
if not (0 < len(user_input) <= MAX_INPUT_LENGTH):
raise ValueError(f"Input length must be between 1 and {MAX_INPUT_LENGTH} characters.")
return user_input
args_schema
.
from pydantic import BaseModel, Field, validator
import re
class SearchToolSchema(BaseModel):
query: str = Field(..., description="Search query", min_length=3, max_length=150)
max_results: int = Field(default=5, gt=0, le=20)
@validator('query')
def query_safety_check(cls, v):
# Example: Prevent obvious command-like structures (adjust regex as needed)
if re.search(r'[;&|`$()]', v):
raise ValueError("Query contains potentially unsafe characters.")
# Example: Block specific keywords (use with caution, can be bypassed)
if "DROP TABLE" in v.upper():
raise ValueError("Potentially harmful SQL keyword detected.")
return v.strip() # Also perform basic sanitization
password
, admin
, script tags <script>
). Block lists are notoriously difficult to maintain and easy for attackers to bypass with encoding or obfuscation techniques. Use them sparingly as a secondary defense layer.>
to >
). Essential when input might be rendered in another context (web page, SQL query). Always prefer parameterized queries for SQL over manual escaping.You can integrate these techniques into your LangChain applications in several ways:
_run
or _arun
methods of your custom tools, or use Pydantic args_schema
for automatic validation upon tool invocation.
from langchain.tools import BaseTool, StructuredTool
from pydantic import BaseModel, Field
from typing import Type
# Using Pydantic schema for validation
class CalculatorInput(BaseModel):
expression: str = Field(..., description="Mathematical expression to evaluate")
# Add more specific validators here if needed
class SafeCalculatorTool(BaseTool):
name = "safe_calculator"
description = "Safely evaluates a mathematical expression."
args_schema: Type[BaseModel] = CalculatorInput
def _run(self, expression: str):
# Further validation/sanitization specific to evaluation can happen here
# For example, use a safe evaluation library instead of eval()
try:
# Use a safer alternative like asteval or numexpr
import numexpr
# Sanitize further if needed, e.g., limit allowed functions/variables
result = numexpr.evaluate(expression).item()
return result
except Exception as e:
return f"Error evaluating expression: {e}"
# Implement _arun for async if needed
RunnableLambda
in LCEL: Create custom validation or sanitization functions and wrap them as RunnableLambda
components to insert them into your LangChain Expression Language (LCEL) chains.
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
def validate_and_sanitize(input_data):
# Assuming input_data is a dictionary containing user_query
query = input_data.get("user_query", "")
if len(query) > 500:
raise ValueError("Query exceeds maximum length of 500 characters.")
# Basic sanitization
sanitized_query = query.strip()
# Add more checks here...
input_data["user_query"] = sanitized_query
return input_data # Pass through the potentially modified data
validation_step = RunnableLambda(validate_and_sanitize)
llm = ChatOpenAI(model="gpt-3.5-turbo") # Replace with your LLM
prompt = ChatPromptTemplate.from_template("Answer the user's question: {user_query}")
# Chain incorporating validation early
chain = (
RunnablePassthrough() # Start with input dictionary
| validation_step
| prompt
| llm
# | output_parser ...
)
# Example invocation
try:
# result = chain.invoke({"user_query": " Tell me about LangChain. "})
# result_long = chain.invoke({"user_query": "A" * 1000}) # This would raise ValueError
pass # Placeholder for actual invocation
except ValueError as e:
print(f"Validation failed: {e}")
Runnable
class.Input validation and sanitization are not silver bullets but form a critical layer in a defense-in-depth security strategy for LangChain applications. By carefully considering potential threats and applying appropriate checks at key integration points, you can significantly reduce the attack surface of your LLM-powered systems. Remember to tailor your validation rules to the specific context and sensitivity of the data and actions involved.
© 2025 ApX Machine Learning