Setting up an evaluation framework involves translating theoretical metrics and evaluation strategies into a repeatable, automated process. This systematic approach is essential for efficiently iterating on and improving your agentic systems. A foundation for evaluation is constructed, comprising a software tool designed specifically to run agents against predefined test cases and measure their performance systematically.An environment serves as a controlled setting where agent behavior can be observed, measured, and compared. It standardizes the execution process, ensuring that variations in results are attributable to the agent's logic or configuration, not inconsistencies in the testing setup. While sophisticated systems can become complex, we'll focus on establishing a solid, extensible baseline.Core Components of an Evaluation FrameworkA basic yet effective evaluation typically consists of the following components:Test Case Suite: A collection of defined scenarios designed to probe specific agent capabilities. Each test case usually includes:An initial input or goal (e.g., a complex question, a task description).Required context or available tools (e.g., simulated database access, specific API endpoints).Ground truth or success criteria (e.g., the expected final answer, specific sub-tasks that must be completed, constraints that must be respected).Agent Interface: A standardized wrapper around your agent code. This allows the to interact with different agent implementations (e.g., a ReAct agent vs. a ToT agent) through a consistent API, typically involving methods like initialize() and run(input).Execution Engine: The orchestrator that loads test cases, initializes the agent via its interface, executes the agent on each test case, and collects the outputs. This includes capturing not just the final result but potentially the intermediate reasoning steps, tool calls, and memory interactions (the agent's "trajectory").Metric Calculators: Functions or classes that process the raw execution outputs to compute the performance metrics defined earlier in the chapter (e.g., goal completion rate, task success, planning accuracy, tool call precision/recall, hallucination frequency, resource usage like token count or latency).Results Logger/Reporter: A module responsible for storing the raw outputs and computed metrics in a structured format (e.g., JSON files, CSV, a database) and potentially generating summary reports or visualizations.Implementing the ComponentsLet's consider practical implementation details, using Python as an example.Test Case DefinitionTest cases can be defined in various formats, such as YAML or JSON files, or directly as Python dictionaries or data classes. A structured format is preferable for clarity and ease of loading.# Example test case structure (using Python dict) test_case_1 = { "id": "TC001", "goal": "Find the current CEO of ExampleCorp and summarize their latest public statement regarding AI.", "available_tools": ["web_search", "company_database_lookup"], "success_criteria": { "ceo_identified": "Jane Doe", # Example ground truth "statement_found": True, "summary_relevant": True, # Requires semantic check "constraints": ["Must use web_search tool at least once"] }, "max_steps": 10 # Optional constraint } test_suite = [test_case_1, ...] # Load from file or define inlineFor expert-level evaluation, test suites should cover a wide range of scenarios: simple lookups, multi-step reasoning tasks, tasks requiring complex tool interactions, scenarios designed to trigger known failure modes (e.g., ambiguity, conflicting information), and edge cases.Agent InterfaceA simple base class can define the expected interface:from abc import ABC, abstractmethod class BaseAgentInterface(ABC): def __init__(self, config): self.config = config # Initialize LLM, tools, memory based on config @abstractmethod def run(self, goal: str, available_tools: list) -> dict: """ Executes the agent's logic for the given goal. Returns a dictionary containing the final answer, execution trajectory, tool calls, errors, etc. """ pass # Example implementation for a specific agent type class MyReActAgent(BaseAgentInterface): def run(self, goal: str, available_tools: list) -> dict: # Implementation of the ReAct loop for this agent trajectory = [] final_answer = None tool_calls = [] errors = [] # ... agent execution logic ... print(f"Running ReAct Agent on goal: {goal}") # Example logging # Simulate execution trajectory.append("Thought: I need to find the CEO.") trajectory.append("Action: company_database_lookup(company='ExampleCorp')") tool_calls.append({"tool": "company_database_lookup", "args": {"company": "ExampleCorp"}, "output": "CEO: Jane Doe"}) trajectory.append("Observation: Found CEO is Jane Doe.") trajectory.append("Thought: Now search for her latest statement.") trajectory.append("Action: web_search(query='Jane Doe ExampleCorp latest AI statement')") tool_calls.append({"tool": "web_search", "args": {"query": "Jane Doe ExampleCorp latest AI statement"}, "output": "Snippet: ...committed to responsible AI..."}) trajectory.append("Observation: Found relevant statement snippet.") final_answer = "CEO is Jane Doe. Latest statement highlights commitment to responsible AI." return { "final_answer": final_answer, "trajectory": trajectory, "tool_calls": tool_calls, "errors": errors, "steps_taken": len(trajectory) // 2 # Approximate steps }This abstraction allows the ability to swap different agent implementations easily.Metric CalculationMetrics should operate on the results dictionary returned by the agent's run method and the ground truth from the test case.def calculate_metrics(agent_output: dict, test_case: dict) -> dict: metrics = {} criteria = test_case["success_criteria"] # Example: Basic goal completion check is_successful = True if "ceo_identified" in criteria: # Simple string check (can be more sophisticated) if criteria["ceo_identified"] not in agent_output.get("final_answer", ""): is_successful = False if criteria.get("statement_found", False): # Placeholder for a check on the final answer content if "statement" not in agent_output.get("final_answer", "").lower(): is_successful = False # Simplified check metrics["success"] = is_successful # Example: Tool usage check required_tool_used = False if "constraints" in criteria: for constraint in criteria["constraints"]: if "Must use web_search" in constraint: if any(call["tool"] == "web_search" for call in agent_output.get("tool_calls", [])): required_tool_used = True else: # Constraint violated, potentially mark as failure or track separately pass # Add logic as needed metrics["required_tool_used"] = required_tool_used # Example: Resource usage metrics["steps_taken"] = agent_output.get("steps_taken", 0) # Could also add token counts, latency if tracked return metricsFor expert use cases, metric calculation might involve sophisticated techniques like semantic similarity checks using embeddings for answer relevance, parsing tool arguments for correctness, or analyzing the reasoning trajectory for logical fallacies.Execution Engine and LoggingThe engine iterates, executes, calculates, and logs.import json import datetime def run_evaluation(agent_interface: BaseAgentInterface, test_suite: list): results = [] timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S") log_filename = f"evaluation_results_{timestamp}.jsonl" print(f"Starting evaluation run. Logging to {log_filename}") for i, test_case in enumerate(test_suite): print(f"Running Test Case {i+1}/{len(test_suite)}: {test_case['id']}") try: agent_output = agent_interface.run( goal=test_case["goal"], available_tools=test_case["available_tools"] ) computed_metrics = calculate_metrics(agent_output, test_case) result_entry = { "test_case_id": test_case["id"], "goal": test_case["goal"], "agent_config": agent_interface.config, # Store agent config used "raw_output": agent_output, "metrics": computed_metrics, "success": computed_metrics.get("success", False) # Promote primary metric } except Exception as e: print(f"Error running test case {test_case['id']}: {e}") result_entry = { "test_case_id": test_case["id"], "goal": test_case["goal"], "agent_config": agent_interface.config, "error": str(e), "success": False, "metrics": {"success": False} # Ensure metrics dict exists } results.append(result_entry) # Log results incrementally (JSON Lines format) with open(log_filename, 'a') as f: f.write(json.dumps(result_entry) + '\n') print("Evaluation run completed.") # Aggregate and report summary statistics total_tests = len(results) successful_tests = sum(r.get('success', False) for r in results) success_rate = (successful_tests / total_tests) * 100 if total_tests > 0 else 0 print(f"\nSummary:") print(f"Total Test Cases: {total_tests}") print(f"Successful: {successful_tests}") print(f"Success Rate: {success_rate:.2f}%") # Add more detailed reporting or visualization generation here generate_summary_visualization(results, timestamp) # Call visualization function return results def generate_summary_visualization(results: list, timestamp: str): # Example: Visualize success rate by test case (simplified) if not results: return ids = [r['test_case_id'] for r in results] success_values = [1 if r.get('success', False) else 0 for r in results] # 1 for success, 0 for fail # Create a simple bar chart showing success/failure per test case plotly_fig = { "data": [ { "x": ids, "y": success_values, "type": "bar", "marker": { "color": ['#37b24d' if s == 1 else '#f03e3e' for s in success_values] # Green for success, Red for failure }, "name": "Test Outcome" } ], "layout": { "title": f"Evaluation Results ({timestamp})", "xaxis": {"title": "Test Case ID", "type": "category"}, "yaxis": {"title": "Outcome (1=Success, 0=Fail)", "tickvals": [0, 1], "ticktext": ["Fail", "Success"]}, "template": "plotly_white" # Use a clean template } } # Save or display the chart (implementation depends on environment) # For web output, you might save this JSON or pass it to a frontend component viz_filename = f"evaluation_summary_{timestamp}.json" with open(viz_filename, 'w') as f: json.dump(plotly_fig, f) print(f"Visualization data saved to {viz_filename}") # Example Plotly JSON structure (single line for embedding): # ```plotly # {"data": [{"x": ["TC001", "TC002"], "y": [1, 0], "type": "bar", "marker": {"color": ["#37b24d", "#f03e3e"]}, "name": "Test Outcome"}], "layout": {"title": "Evaluation Results (Example)", "xaxis": {"title": "Test Case ID", "type": "category"}, "yaxis": {"title": "Outcome (1=Success, 0=Fail)", "tickvals": [0, 1], "ticktext": ["Fail", "Success"]}, "template": "plotly_white"}} # ``` # Example usage # agent_config = {"llm": "gpt-4", "react_params": {...}} # agent = MyReActAgent(config=agent_config) # evaluation_results = run_evaluation(agent, test_suite){"data": [{"x": ["TC001", "TC002", "TC003", "TC004"], "y": [1, 0, 1, 1], "type": "bar", "marker": {"color": ["#37b24d", "#f03e3e", "#37b24d", "#37b24d"], "line": {"color": "#495057", "width": 0.5}}, "name": "Test Outcome"}], "layout": {"title": "Example Test Case Outcomes", "xaxis": {"title": "Test Case ID", "type": "category", "tickangle": -45}, "yaxis": {"title": "Outcome", "tickvals": [0, 1], "ticktext": ["Fail", "Success"], "gridcolor": "#e9ecef"}, "bargap": 0.2, "height": 350, "margin": {"b": 100, "t": 50, "l": 50, "r": 30}, "plot_bgcolor": "#ffffff", "paper_bgcolor": "#ffffff"}}Bar chart illustrating outcomes for four distinct test cases, indicating success (green) or failure (red).Advanced ApproachesFor expert-level applications, enhance this basic approach:Asynchronous Execution: Run multiple test cases in parallel, especially if agent execution involves significant I/O or waiting time (e.g., API calls), to speed up evaluation. Use Python's asyncio library.Dependency Management: Ensure external dependencies (databases, APIs, specific library versions) are consistent across runs. Consider containerization (e.g., Docker) for the evaluation environment.Failure Analysis: Extend logging to capture detailed error messages and agent state at the point of failure. Implement mechanisms to automatically categorize common failure modes.Comparative Evaluation: Design the system to easily run multiple agent versions or configurations against the same test suite and generate comparative reports.Human-in-the-Loop: For subjective metrics like "answer relevance" or "reasoning quality," incorporate interfaces for human evaluators to review and score agent outputs.Cost Tracking: If using paid LLM APIs, integrate cost estimation based on token usage for each test run.Building even a basic evaluation process provides immense value by changing evaluation from an ad-hoc activity into a systematic, repeatable process. It forms the foundation for data-driven development, enabling you to reliably track progress, identify regressions, and pinpoint areas for optimization in your complex systems. Start simple, and iteratively enhance your process as your evaluation needs become more sophisticated.