Theory provides the groundwork, but practical application solidifies understanding. This section walks you through the process of evaluating a LangChain agent using LangSmith, applying the concepts discussed earlier in this chapter. We will set up a simple agent, create an evaluation dataset, run the agent against the dataset, and implement custom evaluation logic to assess its performance programmatically.
This exercise assumes you have LangSmith set up and your API key configured in your environment (LANGCHAIN_API_KEY
). You should also have basic familiarity with creating LangChain agents and using tools.
First, let's define a straightforward agent that uses a search tool. We'll use Tavily as our search tool for this example, but you could substitute another search tool or custom tools. Ensure you have the necessary packages installed (langchain
, langchain_openai
, tavily-python
, langsmith
). Also, set your Tavily API key (TAVILY_API_KEY
) and OpenAI API key (OPENAI_API_KEY
) as environment variables.
import os
from langchain_openai import ChatOpenAI
from langchain_community.tools.tavily_search import TavilySearchResults
from langchain import hub
from langchain.agents import create_openai_functions_agent, AgentExecutor
from langsmith import Client
# Ensure API keys are set as environment variables
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
# os.environ["TAVILY_API_KEY"] = "YOUR_TAVILY_API_KEY"
# os.environ["LANGCHAIN_API_KEY"] = "YOUR_LANGSMITH_API_KEY"
# os.environ["LANGCHAIN_TRACING_V2"] = "true" # Ensure tracing is enabled
# os.environ["LANGCHAIN_PROJECT"] = "Agent Evaluation Example" # Optional: Define a LangSmith project
# Initialize the LLM and Tool
llm = ChatOpenAI(model="gpt-3.5-turbo-1106", temperature=0)
search_tool = TavilySearchResults(max_results=2)
tools = [search_tool]
# Get the prompt template
# Using a standard OpenAI Functions Agent prompt
prompt = hub.pull("hwchase17/openai-functions-agent")
# Create the agent
# This agent is designed to work with models that support function calling
agent = create_openai_functions_agent(llm, tools, prompt)
# Create the AgentExecutor
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
# Test invocation (optional)
# print(agent_executor.invoke({"input": "What was the score of the last SF Giants game?"}))
We now have agent_executor
, which represents the system we want to evaluate.
Evaluation requires a set of inputs and, ideally, expected outputs or criteria against which to judge the agent's performance. Let's create a small dataset directly in LangSmith using the client library. We'll include inputs (questions for our agent) and optional reference outputs.
# Initialize LangSmith client
client = Client()
dataset_name = "Simple Search Agent Questions V1"
dataset_description = "Basic questions requiring web search."
# Check if dataset exists, create if not
try:
dataset = client.read_dataset(dataset_name=dataset_name)
print(f"Dataset '{dataset_name}' already exists.")
except Exception: # LangSmith client raises a generic Exception if not found
dataset = client.create_dataset(
dataset_name=dataset_name,
description=dataset_description,
)
print(f"Created dataset '{dataset_name}'.")
# Define examples (input questions and optional reference outputs)
examples = [
("What is the capital of France?", "Paris"),
("Who won the 2023 Formula 1 World Championship?", "Max Verstappen"),
("What is the main component of air?", "Nitrogen"),
("Summarize the plot of the movie 'Inception'.", "A thief who steals information by entering people's dreams takes on the inverse task of planting an idea into a target's subconscious."), # Example reference output
]
# Add examples to the dataset
for input_query, reference_output in examples:
client.create_example(
inputs={"input": input_query},
outputs={"reference": reference_output}, # Using 'reference' key for the expected output
dataset_id=dataset.id,
)
print(f"Added {len(examples)} examples to dataset '{dataset_name}'.")
After running this code, you should see a new dataset named "Simple Search Agent Questions V1" in your LangSmith account, populated with the defined examples. The outputs
dictionary in create_example
can store reference values, labels, or any other information useful for evaluation.
With the agent defined and the dataset created, we can now run the agent over each example in the dataset using LangSmith's evaluation utilities. We'll start without a custom evaluator, primarily to collect traces and observe behavior.
from langsmith.evaluation import evaluate
# Define a function that encapsulates the agent invocation
# This is needed for the evaluate function
def agent_predictor(inputs: dict) -> dict:
"""Runs the agent executor for a given input dictionary."""
return agent_executor.invoke({"input": inputs["input"]}) # Assumes dataset input key is "input"
# Run the evaluation
# This will execute the agent_predictor for each example in the dataset
# Results and traces will be automatically logged to LangSmith
evaluation_results = evaluate(
agent_predictor,
data=dataset_name, # Can pass dataset name directly
description="Initial evaluation run for the search agent.",
project_name="Agent Eval Run - Simple Search", # Optional: Logs to a specific project run
# metadata={"agent_version": "1.0"}, # Optional: Add metadata to the run
)
print("Evaluation run completed. Check LangSmith for results.")
Navigate to your LangSmith project. You should find a new evaluation run associated with the dataset. Click on it to explore:
Simply running the agent and tracing is useful for debugging, but quantitative evaluation requires defining specific metrics. Let's create a custom evaluator function that checks if the agent's output contains the reference answer (case-insensitive).
from langsmith.evaluation import EvaluationResult, run_evaluator
@run_evaluator
def check_contains_reference(run, example) -> EvaluationResult:
"""
Checks if the agent's output contains the reference answer (case-insensitive).
Args:
run: The LangSmith run object for the agent execution.
example: The LangSmith example object from the dataset.
Returns:
An EvaluationResult with a score (1 for contains, 0 otherwise)
and a descriptive key.
"""
agent_output = run.outputs.get("output") if run.outputs else None
reference_output = example.outputs.get("reference") if example.outputs else None
if agent_output is None or reference_output is None:
# Handle cases where output or reference is missing
score = 0
comment = "Agent output or reference output missing."
elif str(reference_output).lower() in str(agent_output).lower():
score = 1 # Success: Reference found in agent output
comment = "Reference answer found."
else:
score = 0 # Failure: Reference not found
comment = f"Reference '{reference_output}' not found in output."
return EvaluationResult(
key="contains_reference", # Name of the metric
score=score, # The numeric score (0 or 1 here)
comment=comment # Optional qualitative feedback
)
This function uses the @run_evaluator
decorator, indicating it's designed for LangSmith evaluation. It accesses the agent's actual output (run.outputs
) and the reference output stored in the dataset (example.outputs
). It returns an EvaluationResult
object containing a metric name (key
) and a score.
Now, let's re-run the evaluation, this time including our custom evaluator.
# Run evaluation again, now with the custom evaluator
custom_eval_results = evaluate(
agent_predictor,
data=dataset_name,
evaluators=[check_contains_reference], # Pass the custom evaluator function
description="Evaluation run with custom 'contains_reference' check.",
project_name="Agent Eval Run - Custom Check", # Log to a different run project
# metadata={"agent_version": "1.0", "evaluator": "contains_reference_v1"},
)
print("Evaluation run with custom evaluator completed. Check LangSmith.")
Go back to LangSmith and view this new evaluation run. In the results table, you should now see a new column titled contains_reference
(matching the key
in our EvaluationResult
). This column will display the score (0 or 1) for each example based on our custom logic. You can sort and filter by this metric to quickly identify failures. Hovering over or clicking into the feedback cell often shows the comment
provided by the evaluator.
If we were to visualize the results of this simple evaluation (hypothetically, based on the contains_reference
scores), it might look something like this:
A simple bar chart showing the count of examples passing (score=1) and failing (score=0) the
contains_reference
evaluation metric.
This practical exercise demonstrated the core loop of evaluating an agent with LangSmith: defining the agent, creating a dataset, running evaluation, and implementing custom checks. From here, you can build more complex evaluations:
evaluate
to calculate multiple metrics simultaneously.Systematic evaluation using tools like LangSmith is indispensable for building reliable LLM applications. It moves beyond anecdotal testing, providing quantifiable metrics and detailed tracing to understand and improve agent performance over time.
© 2025 ApX Machine Learning