Even with well-designed prompts for tool selection and operation, AI agents will inevitably encounter errors during tool execution. APIs can become temporarily unavailable, inputs might be malformed in unexpected ways, tools might return errors, or network issues can interrupt communication. Building robust agentic workflows requires anticipating these issues and providing the agent with instructions on how to handle them. Prompt engineering is a key method for guiding an agent's response to such failures, enabling it to recover, retry, or at least fail gracefully.
When an agent attempts to use an external tool, several types of errors can occur. Understanding these categories helps in crafting prompts that address them effectively:
Connectivity and Availability Errors:
API Usage Errors (Client-Side):
Tool-Specific Operational Errors (Server-Side or Tool Logic):
{"status": "error", "message": "Invalid search query"}
).Resource Errors:
The first step in handling an error is for the agent to recognize that one has occurred. Your prompts should guide the agent to inspect tool outputs for signs of trouble.
Checking Status Codes: For tools interacting via HTTP (like most APIs), instruct the agent to check the HTTP status code.
After making a call to the `weather_api`, examine the HTTP status code. A status code of 200 indicates success. Any other status code, especially in the 4xx or 5xx range, should be treated as an error.
Parsing Error Fields: Many APIs return structured error responses (e.g., JSON with an error
key).
When using the `user_database_tool`, check the JSON response. If the response contains a top-level key named "error_message" or "detail", an error has occurred. Extract the value of this key for further processing.
Keyword Spotting: For less structured outputs or tool logs, you might instruct the agent to look for specific error keywords.
If the output from the `image_processing_script` contains terms like "failed", "exception", "Traceback", or "Error:", assume an error has occurred.
Once an error is detected, the agent needs instructions on how to proceed. Your prompts can specify a range of actions:
Simple Retries: For transient issues like network glitches or temporary service unavailability, a simple retry (perhaps with a delay) is often effective.
If the `stock_quote_api` returns a 503 Service Unavailable error or a timeout occurs, wait for 3 seconds and then attempt the API call one more time. If it fails a second time, report the failure.
Retries with Modified Inputs: If an error suggests a problem with the input (e.g., a "bad request" error), the agent might be prompted to try again with a corrected or simplified input, if feasible.
If the `product_search_tool` returns an "InvalidParameterValue" error for a complex query, try to simplify the query by removing optional filters and attempt the search again. For instance, if searching for "blue striped cotton shirt size M", and it fails, try "blue shirt".
Using Fallback Tools or Strategies: If a primary tool fails persistently, you can direct the agent to try an alternative.
If `primary_translation_service` fails to provide a translation after two attempts, use `secondary_translation_service` with the same input text. If both fail, inform the user that translation is currently unavailable.
Requesting Clarification or Correction: If the error stems from ambiguous or incorrect user input that the agent cannot resolve itself, prompt the agent to ask the user for help.
If the `calendar_tool` reports an "ambiguous date format" error, ask the user to provide the date in 'YYYY-MM-DD' format.
Graceful Degradation of Service: Sometimes, a non-critical tool failure means a part of the task cannot be completed, but the overall goal might still be partially achievable.
When generating a research report, if the `fetch_latest_news_tool` fails, proceed with generating the report using only the information from the `company_database_tool` and include a note in the report: "Latest news could not be retrieved due to a temporary issue with the news service."
Structured Error Reporting: For debugging and monitoring, it's useful to have the agent report errors in a consistent format.
If any tool execution results in an unrecoverable error, output the following information:
Tool Name: [Name of the tool that failed]
Input Provided: [The exact input given to the tool]
Error Type: [e.g., HTTP_500, API_Auth_Error, Timeout, Malformed_Output]
Error Details: [Specific error message or code received from the tool]
Recovery Attempts: [Number of retries, alternative tools tried]
Consider an agent tasked with fetching user data from an API.
Initial Prompt Snippet (without robust error handling):
Goal: Fetch user details for user ID 123.
Tool available: `get_user_data(user_id)` - calls `https://api.example.com/users/{user_id}`.
If api.example.com
is down, the agent might just stop or return a raw network error.
Improved Prompt Snippet (with error handling):
Goal: Fetch user details for user ID 123.
Tool available: `get_user_data(user_id)` - calls `https://api.example.com/users/{user_id}`.
Instructions for `get_user_data`:
1. Call the API.
2. If the HTTP status is 200, return the JSON response.
3. If the HTTP status is 401 or 403 (Authentication/Authorization Error): Report "API access denied. Check credentials." Do not retry.
4. If the HTTP status is 404 (Not Found): Report "User ID not found." Do not retry.
5. If the HTTP status is 500 or 503 (Server Error/Unavailable), or if a network timeout occurs:
a. Wait for 5 seconds.
b. Retry the call once.
c. If it fails again, report "User data API is temporarily unavailable. Tried 2 times."
6. For any other 4xx error: Report "API request error: [status code] - [response text if available]."
The decision process for handling a tool error can often be represented as a flow. Even simple instructions imply a decision tree for the agent.
An agent's decision path when a tool call encounters an issue, guided by prompt-based error handling rules.
By thoughtfully designing prompts that guide agents through error scenarios, you significantly increase the reliability and robustness of your agentic systems. Agents that can intelligently respond to tool failures are more autonomous and provide a better user experience.
Was this section helpful?
© 2025 ApX Machine Learning