As you build an arsenal of tools for your LLM agents, ensuring each tool is reliable and performs as expected becomes essential. Faulty tools can lead to unpredictable agent behavior, incorrect outputs, and a frustrating user experience. This section focuses on establishing robust testing practices, specifically unit and integration testing, tailored for the unique context of LLM agent tools. By rigorously testing your tools, you lay the foundation for dependable and effective agent systems.
Testing tools designed for LLM agents presents a slightly different set of challenges compared to traditional software components. While the core principles of unit and integration testing remain, the interaction with an LLM adds layers of consideration:
The primary goal of unit and integration testing for agent tools is to verify that each tool functions correctly in isolation and integrates seamlessly with the agent's operational environment.
Unit tests focus on the smallest testable parts of your tool, typically individual functions or methods within a class. The objective is to verify that each piece of the tool's logic works correctly, independent of other parts of the system or external dependencies.
Main Aspects to Cover in Unit Tests:
Core Functionality:
calculate_mortgage_payment
tool should return the correct payment amount for given loan parameters.Input Validation and Handling:
Output Formatting and Structure:
Error Handling and Reporting:
State Management (for Stateful Tools):
Mocking External Dependencies:
unittest.mock
library (with MagicMock
, patch
) is commonly used for this.Example: Unit Testing a Simple Python Tool
Consider a Python tool that fetches a stock price:
# stock_tool.py
import requests
class StockPriceTool:
def __init__(self, api_key: str):
self.api_key = api_key
self.base_url = "https://api.examplefinance.com/v1/stock"
def get_price(self, ticker_symbol: str) -> dict:
if not isinstance(ticker_symbol, str) or not ticker_symbol.isalpha():
return {"error": "Invalid ticker symbol format."}
try:
response = requests.get(
f"{self.base_url}/{ticker_symbol}/price",
params={"apikey": self.api_key}
)
response.raise_for_status() # Raises HTTPError for bad responses (4XX or 5XX)
data = response.json()
if "price" not in data:
return {"error": f"Price data not found for {ticker_symbol}."}
return {"ticker": ticker_symbol, "price": data["price"]}
except requests.exceptions.RequestException as e:
return {"error": f"API request failed: {str(e)}"}
except ValueError: # JSONDecodeError inherits from ValueError
return {"error": "Failed to parse API response."}
A unit test using pytest
and unittest.mock
might look like this:
# test_stock_tool.py
import pytest
from unittest.mock import patch, MagicMock
from stock_tool import StockPriceTool
@pytest.fixture
def tool():
return StockPriceTool(api_key="test_api_key")
def test_get_price_success(tool, monkeypatch):
mock_response = MagicMock()
mock_response.json.return_value = {"price": 150.75}
mock_response.raise_for_status.return_value = None # Simulate no HTTP error
# Use monkeypatch from pytest for requests.get, similar to unittest.mock.patch
monkeypatch.setattr("requests.get", MagicMock(return_value=mock_response))
result = tool.get_price("AAPL")
assert result == {"ticker": "AAPL", "price": 150.75}
requests.get.assert_called_once_with(
"https://api.examplefinance.com/v1/stock/AAPL/price",
params={"apikey": "test_api_key"}
)
def test_get_price_invalid_ticker_format(tool):
result = tool.get_price("AAPL123")
assert result == {"error": "Invalid ticker symbol format."}
def test_get_price_api_failure(tool, monkeypatch):
mock_response = MagicMock()
mock_response.raise_for_status.side_effect = requests.exceptions.HTTPError("API Unavailable")
monkeypatch.setattr("requests.get", MagicMock(return_value=mock_response))
result = tool.get_price("MSFT")
assert "API request failed" in result["error"]
def test_get_price_data_not_found(tool, monkeypatch):
mock_response = MagicMock()
mock_response.json.return_value = {"message": "Ticker not found"} # API returns no price field
mock_response.raise_for_status.return_value = None
monkeypatch.setattr("requests.get", MagicMock(return_value=mock_response))
result = tool.get_price("GOOG")
assert result == {"error": "Price data not found for GOOG."}
Best Practices for Unit Tests:
test_tool_handles_invalid_input_gracefully
).A unit test verifies an individual agent tool's method or function in isolation, often using mocks for external dependencies like API calls.
While unit tests confirm individual components work correctly, integration tests verify that your tools interact properly with the larger agent framework or orchestrator. They focus on the "plumbing" that connects your tool to the system that will eventually call it based on an LLM's decision.
Key Aspects to Cover in Integration Tests:
Tool Registration and Discovery:
Tool Invocation by the Framework:
Parameter Mapping and Transformation:
Response Handling by the Framework:
Interaction with Real External Services (Staging/Test Environments):
Simulated LLM Interaction:
Approaches to Integration Testing:
Example: Integration Snippet (Conceptual)
Imagine you have an agent framework that has a method run_agent_query(query: str)
. An integration test might look like:
# test_agent_integration.py
from my_agent_framework import AgentFramework
from stock_tool import StockPriceTool # Assuming it's registered
def test_stock_tool_integration_with_agent(monkeypatch):
# Mock external API call for the StockPriceTool, even in integration,
# if you want to focus solely on framework-tool interaction and avoid external flakiness.
# Or, let it hit a test/staging API endpoint.
mock_api_response = MagicMock()
mock_api_response.json.return_value = {"price": 200.00}
mock_api_response.raise_for_status.return_value = None
monkeypatch.setattr("requests.get", MagicMock(return_value=mock_api_response))
agent = AgentFramework()
# Assume StockPriceTool is registered with the agent under the name "getStockPrice"
# and the agent framework can parse the query to call this tool.
# This part is highly dependent on your specific agent framework.
# This query is designed to deterministically trigger the stock tool
# (either through specific keywords or a simplified NLU in a test setting)
agent_response = agent.run_agent_query("What is the price of TSLA stock?")
# Assertions would check if the tool was called and its result processed
# This often involves inspecting logs or internal states of the agent/framework if direct output isn't sufficient.
# For simplicity, let's assume agent_response contains a structured result:
assert "TSLA" in agent_response["tool_result"]["ticker"]
assert agent_response["tool_result"]["price"] == 200.00
# Potentially assert that requests.get was called with "TSLA"
requests.get.assert_called_with(
"https://api.examplefinance.com/v1/stock/TSLA/price", # Or your actual ticker
params={"apikey": "test_api_key"} # Or however API key is managed
)
This example is simplified. Real integration tests with agent frameworks often involve more setup to configure the agent, its tools, and potentially mock the LLM's decision-making process to reliably trigger specific tools.
An integration test checks how a tool interacts with the agent framework or a simulated agent environment. This may involve real external service test instances or further mocking, depending on the test's focus.
It is important to reiterate that the unit and integration tests discussed here are primarily concerned with the correctness and reliability of the tool itself and its integration into the agent's plumbing. They ensure that if an LLM (or the agent framework) decides to call a tool with certain arguments, the tool behaves predictably and returns a well-formed response.
These tests generally do not evaluate:
Evaluating these LLM-specific aspects falls under the broader umbrella of LLM evaluation and agent performance assessment, which are distinct, though related, disciplines. Well-tested tools are a prerequisite for meaningful LLM agent evaluation. If the tools themselves are unreliable, it's impossible to determine if a failure is due to the tool or the LLM's reasoning.
By implementing thorough unit and integration tests, you build a robust foundation of reliable tools. This not only improves the dependability of your LLM agents but also simplifies debugging and maintenance, as you can be more confident that individual tool components are functioning as intended. This disciplined approach to testing is a hallmark of sound engineering in the development of advanced LLM agent systems.
Was this section helpful?
© 2025 ApX Machine Learning