Inconsistent LLM outputs present challenges, necessitating effective parsing, validation, and error handling. Now, you will build a function that attempts to extract structured data from text using an LLM, ensuring the output is reliable enough for application use.Imagine you have an application that needs to extract contact information (name and email) from user-submitted text snippets. The goal is to get this information reliably in a JSON format.Scenario: Extracting Contact InformationLet's start with some sample text:From: Sarah Chen <s.chen@example.com> Subject: Project Update Hi team, Just wanted to share the latest updates... Best, SarahOur target output format is a simple JSON object:{ "name": "Sarah Chen", "email": "s.chen@example.com" }Initial Attempt: Prompting and Basic ParsingFirst, we define a prompt designed to elicit the desired JSON output. Building on techniques from previous chapters, we'll be explicit about the format.def create_extraction_prompt(text_input): """Creates a prompt to extract name and email into JSON.""" prompt = f""" Extract the full name and email address from the following text. Provide the output STRICTLY in the following JSON format: {{"name": "...", "email": "..."}} If no name or email is found, return JSON with empty strings as values: {{"name": "", "email": ""}} Do not include any explanation or introductory text outside the JSON structure. Text: \"\"\" {text_input} \"\"\" JSON Output: """ return prompt # Assume call_llm_api is a function that takes a prompt # and returns the LLM's text response. # def call_llm_api(prompt: str) -> str: # # Placeholder for actual API call logic # # In a real scenario, this would interact with OpenAI, Anthropic, etc. # # For this example, let's simulate some possible outputs. # passNow, let's try to process the response. A naive approach might just assume the LLM returns perfect JSON.import json # Placeholder for the actual text input input_text = """ From: Sarah Chen <s.chen@example.com> Subject: Project Update ... """ # Simulate a successful LLM response simulated_good_response = '{"name": "Sarah Chen", "email": "s.chen@example.com"}' # Attempt to parse try: prompt = create_extraction_prompt(input_text) # raw_output = call_llm_api(prompt) # Real call would be here raw_output = simulated_good_response # Using simulated output extracted_data = json.loads(raw_output) print(f"Successfully extracted: {extracted_data}") except json.JSONDecodeError: print("Error: Failed to decode JSON from LLM response.") except Exception as e: print(f"An unexpected error occurred: {e}") This works if the LLM behaves perfectly. But what if the response is slightly off?# Simulate a malformed response (e.g., trailing comma, missing quote) simulated_bad_response = '{"name": "Sarah Chen", "email": "s.chen@example.com",}' try: # ... (prompt creation) raw_output = simulated_bad_response # Using simulated bad output extracted_data = json.loads(raw_output) print(f"Successfully extracted: {extracted_data}") except json.JSONDecodeError as e: print(f"Error: Failed to decode JSON. Details: {e}") # Catches the error except Exception as e: print(f"An unexpected error occurred: {e}")The json.loads call will raise a JSONDecodeError, which our basic try...except block catches. This prevents the application from crashing but doesn't get us the data.Adding Data Validation with PydanticEven if the JSON is syntactically valid, it might not have the structure or data types we expect. Let's introduce Pydantic for schema validation.First, install Pydantic if you haven't already: pip install pydanticNow, define a Pydantic model representing our desired structure:from pydantic import BaseModel, EmailStr, ValidationError class ContactInfo(BaseModel): name: str email: EmailStr # Pydantic validates this is a valid email formatLet's integrate this into our processing logic:import json from pydantic import BaseModel, EmailStr, ValidationError # --- Pydantic Model --- class ContactInfo(BaseModel): name: str email: EmailStr # --- Placeholder for LLM call --- def call_llm_api(prompt: str) -> str: # Simulate different outputs for demonstration # return '{"name": "Sarah Chen", "email": "s.chen@example.com"}' # Good # return '{"name": "Sarah Chen", "email": "not-an-email"}' # Bad Email Format return '{"contact_name": "Sarah Chen", "address": "s.chen@example.com"}' # Wrong field names # --- Processing Logic --- input_text = "From: Sarah Chen <s.chen@example.com> ..." # Sample input prompt = create_extraction_prompt(input_text) # Defined earlier try: raw_output = call_llm_api(prompt) print(f"Raw LLM Output:\n{raw_output}") # Attempt to parse and validate directly using Pydantic contact_info = ContactInfo.model_validate_json(raw_output) print(f"\nSuccessfully validated data: {contact_info.model_dump()}") except json.JSONDecodeError as e: print(f"\nError: Failed to decode JSON. Details: {e}") print("LLM output likely wasn't valid JSON.") except ValidationError as e: print(f"\nError: Data validation failed. Details:\n{e}") print("LLM output was JSON, but didn't match the required schema (e.g., wrong field names, invalid email format).") except Exception as e: print(f"\nAn unexpected error occurred: {e}") Run this code with the different simulated call_llm_api return values. You'll see that model_validate_json handles both JSON decoding errors and schema validation errors, providing informative details when validation fails (e.g., which field is wrong, why the email is invalid).Implementing Retry MechanismsSometimes, an LLM might fail due to a transient issue or simply generate a slightly incorrect output on the first try. A simple retry mechanism can often resolve these intermittent problems.Let's wrap our API call and processing logic in a retry loop:import json import time from pydantic import BaseModel, EmailStr, ValidationError # --- Pydantic Model --- class ContactInfo(BaseModel): name: str email: EmailStr # --- Placeholder LLM Call (modified to sometimes fail) --- import random _call_count = 0 def call_llm_api_flaky(prompt: str) -> str: global _call_count _call_count += 1 print(f"LLM API Call Attempt #{_call_count}") # Simulate failure on the first try, success on the second if _call_count == 1 and random.random() < 0.7: # 70% chance of initial failure # return '{"name": "Sarah Chen", "email": "s.chen@example.com",}' # Malformed JSON return '{"name": "Sarah Chen"}' # Missing field else: return '{"name": "Sarah Chen", "email": "s.chen@example.com"}' # Good response # --- Processing Function with Retries --- def extract_contact_info_robust(text_input: str, max_retries: int = 2) -> ContactInfo | None: global _call_count _call_count = 0 # Reset counter for each new extraction attempt prompt = create_extraction_prompt(text_input) for attempt in range(max_retries): print(f"\n--- Attempt {attempt + 1} of {max_retries} ---") try: raw_output = call_llm_api_flaky(prompt) print(f"Raw LLM Output: {raw_output}") contact_info = ContactInfo.model_validate_json(raw_output) print("Validation Successful!") return contact_info # Success! Exit the loop and return data except (json.JSONDecodeError, ValidationError) as e: print(f"Attempt failed: {e.__class__.__name__}") if attempt < max_retries - 1: print("Retrying...") # Optional: Add a small delay before retrying # time.sleep(0.5) else: print("Max retries reached. Failed to extract valid data.") # Log the final error and the problematic output for debugging # logger.error(f"Failed after {max_retries} attempts. Last output: {raw_output}", exc_info=True) return None # Indicate failure except Exception as e: # Handle unexpected errors (e.g., network issues if this were a real API call) print(f"An unexpected error occurred: {e}") # Log this error seriously # logger.exception("Unexpected error during extraction") return None # Indicate failure return None # Should be unreachable if loop logic is correct, but belt-and-suspenders # --- Example Usage --- input_text = "From: Sarah Chen <s.chen@example.com> ..." result = extract_contact_info_robust(input_text) if result: print(f"\nFinal Extracted Data: {result.model_dump_json(indent=2)}") else: print("\nCould not extract contact information reliably.") In this version:The call_llm_api_flaky function simulates unreliable behavior.The extract_contact_info_robust function loops up to max_retries times.It catches both JSONDecodeError and ValidationError.If an attempt fails but retries remain, it prints a message and continues the loop.If all retries are exhausted, it logs the failure (you'd use a proper logging library in production) and returns None.It also includes a catch-all Exception block for truly unexpected issues.Considering Fallbacks and Further StepsWhat if extract_contact_info_robust returns None? Your application needs a plan:Default Values: Use empty strings or predefined defaults.Log and Alert: Record the failure and the input text for later review. Alert developers if failures exceed a threshold.Human Review Queue: Route the problematic text snippet to a human for manual extraction.Simpler Method: Attempt a less complex extraction method (e.g., regular expressions) as a fallback, accepting it might be less accurate.Furthermore, you could integrate Moderation APIs (as discussed earlier in the chapter) before even sending the text to the LLM for extraction or after receiving the response, adding another layer of safety regarding the content itself.This exercise demonstrates a practical workflow for making LLM interactions more resilient. By combining careful prompting, parsing, validation, and error handling (like retries and fallbacks), you can significantly improve the reliability of applications built on Large Language Models. Remember that this is often an iterative process; monitoring failures in production will guide further refinements to your prompts and handling logic.