Alright, let's put the concepts from this chapter into practice. We've discussed the challenges with inconsistent LLM outputs and the necessity of parsing, validation, and error handling. Now, you'll build a function that attempts to extract structured data from text using an LLM, ensuring the output is reliable enough for application use.
Imagine you have an application that needs to extract contact information (name and email) from user-submitted text snippets. The goal is to get this information reliably in a JSON format.
Let's start with some sample text:
From: Sarah Chen <s.chen@example.com>
Subject: Project Update
Hi team,
Just wanted to share the latest updates...
Best,
Sarah
Our target output format is a simple JSON object:
{
"name": "Sarah Chen",
"email": "s.chen@example.com"
}
First, we define a prompt designed to elicit the desired JSON output. Building on techniques from previous chapters, we'll be explicit about the format.
def create_extraction_prompt(text_input):
"""Creates a prompt to extract name and email into JSON."""
prompt = f"""
Extract the full name and email address from the following text.
Provide the output STRICTLY in the following JSON format:
{{"name": "...", "email": "..."}}
If no name or email is found, return JSON with empty strings as values:
{{"name": "", "email": ""}}
Do not include any explanation or introductory text outside the JSON structure.
Text:
\"\"\"
{text_input}
\"\"\"
JSON Output:
"""
return prompt
# Assume call_llm_api is a function that takes a prompt
# and returns the LLM's text response.
# def call_llm_api(prompt: str) -> str:
# # Placeholder for actual API call logic
# # In a real scenario, this would interact with OpenAI, Anthropic, etc.
# # For this example, let's simulate some possible outputs.
# pass
Now, let's try to process the response. A naive approach might just assume the LLM returns perfect JSON.
import json
# Placeholder for the actual text input
input_text = """
From: Sarah Chen <s.chen@example.com>
Subject: Project Update
...
"""
# Simulate a successful LLM response
simulated_good_response = '{"name": "Sarah Chen", "email": "s.chen@example.com"}'
# Attempt to parse
try:
prompt = create_extraction_prompt(input_text)
# raw_output = call_llm_api(prompt) # Real call would be here
raw_output = simulated_good_response # Using simulated output
extracted_data = json.loads(raw_output)
print(f"Successfully extracted: {extracted_data}")
except json.JSONDecodeError:
print("Error: Failed to decode JSON from LLM response.")
except Exception as e:
print(f"An unexpected error occurred: {e}")
This works if the LLM behaves perfectly. But what if the response is slightly off?
# Simulate a malformed response (e.g., trailing comma, missing quote)
simulated_bad_response = '{"name": "Sarah Chen", "email": "s.chen@example.com",}'
try:
# ... (prompt creation)
raw_output = simulated_bad_response # Using simulated bad output
extracted_data = json.loads(raw_output)
print(f"Successfully extracted: {extracted_data}")
except json.JSONDecodeError as e:
print(f"Error: Failed to decode JSON. Details: {e}") # Catches the error
except Exception as e:
print(f"An unexpected error occurred: {e}")
The json.loads
call will raise a JSONDecodeError
, which our basic try...except
block catches. This prevents the application from crashing but doesn't get us the data.
Even if the JSON is syntactically valid, it might not have the structure or data types we expect. Let's introduce Pydantic for schema validation.
First, install Pydantic if you haven't already:
pip install pydantic
Now, define a Pydantic model representing our desired structure:
from pydantic import BaseModel, EmailStr, ValidationError
class ContactInfo(BaseModel):
name: str
email: EmailStr # Pydantic validates this is a valid email format
Let's integrate this into our processing logic:
import json
from pydantic import BaseModel, EmailStr, ValidationError
# --- Pydantic Model ---
class ContactInfo(BaseModel):
name: str
email: EmailStr
# --- Placeholder for LLM call ---
def call_llm_api(prompt: str) -> str:
# Simulate different outputs for demonstration
# return '{"name": "Sarah Chen", "email": "s.chen@example.com"}' # Good
# return '{"name": "Sarah Chen", "email": "not-an-email"}' # Bad Email Format
return '{"contact_name": "Sarah Chen", "address": "s.chen@example.com"}' # Wrong field names
# --- Processing Logic ---
input_text = "From: Sarah Chen <s.chen@example.com> ..." # Sample input
prompt = create_extraction_prompt(input_text) # Defined earlier
try:
raw_output = call_llm_api(prompt)
print(f"Raw LLM Output:\n{raw_output}")
# Attempt to parse and validate directly using Pydantic
contact_info = ContactInfo.model_validate_json(raw_output)
print(f"\nSuccessfully validated data: {contact_info.model_dump()}")
except json.JSONDecodeError as e:
print(f"\nError: Failed to decode JSON. Details: {e}")
print("LLM output likely wasn't valid JSON.")
except ValidationError as e:
print(f"\nError: Data validation failed. Details:\n{e}")
print("LLM output was JSON, but didn't match the required schema (e.g., wrong field names, invalid email format).")
except Exception as e:
print(f"\nAn unexpected error occurred: {e}")
Run this code with the different simulated call_llm_api
return values. You'll see that model_validate_json
handles both JSON decoding errors and schema validation errors, providing informative details when validation fails (e.g., which field is wrong, why the email is invalid).
Sometimes, an LLM might fail due to a transient issue or simply generate a slightly incorrect output on the first try. A simple retry mechanism can often resolve these intermittent problems.
Let's wrap our API call and processing logic in a retry loop:
import json
import time
from pydantic import BaseModel, EmailStr, ValidationError
# --- Pydantic Model ---
class ContactInfo(BaseModel):
name: str
email: EmailStr
# --- Placeholder LLM Call (modified to sometimes fail) ---
import random
_call_count = 0
def call_llm_api_flaky(prompt: str) -> str:
global _call_count
_call_count += 1
print(f"LLM API Call Attempt #{_call_count}")
# Simulate failure on the first try, success on the second
if _call_count == 1 and random.random() < 0.7: # 70% chance of initial failure
# return '{"name": "Sarah Chen", "email": "s.chen@example.com",}' # Malformed JSON
return '{"name": "Sarah Chen"}' # Missing field
else:
return '{"name": "Sarah Chen", "email": "s.chen@example.com"}' # Good response
# --- Processing Function with Retries ---
def extract_contact_info_robust(text_input: str, max_retries: int = 2) -> ContactInfo | None:
global _call_count
_call_count = 0 # Reset counter for each new extraction attempt
prompt = create_extraction_prompt(text_input)
for attempt in range(max_retries):
print(f"\n--- Attempt {attempt + 1} of {max_retries} ---")
try:
raw_output = call_llm_api_flaky(prompt)
print(f"Raw LLM Output: {raw_output}")
contact_info = ContactInfo.model_validate_json(raw_output)
print("Validation Successful!")
return contact_info # Success! Exit the loop and return data
except (json.JSONDecodeError, ValidationError) as e:
print(f"Attempt failed: {e.__class__.__name__}")
if attempt < max_retries - 1:
print("Retrying...")
# Optional: Add a small delay before retrying
# time.sleep(0.5)
else:
print("Max retries reached. Failed to extract valid data.")
# Log the final error and the problematic output for debugging
# logger.error(f"Failed after {max_retries} attempts. Last output: {raw_output}", exc_info=True)
return None # Indicate failure
except Exception as e:
# Handle unexpected errors (e.g., network issues if this were a real API call)
print(f"An unexpected error occurred: {e}")
# Log this error seriously
# logger.exception("Unexpected error during extraction")
return None # Indicate failure
return None # Should be unreachable if loop logic is correct, but belt-and-suspenders
# --- Example Usage ---
input_text = "From: Sarah Chen <s.chen@example.com> ..."
result = extract_contact_info_robust(input_text)
if result:
print(f"\nFinal Extracted Data: {result.model_dump_json(indent=2)}")
else:
print("\nCould not extract contact information reliably.")
In this version:
call_llm_api_flaky
function simulates unreliable behavior.extract_contact_info_robust
function loops up to max_retries
times.JSONDecodeError
and ValidationError
.None
.Exception
block for truly unexpected issues.What if extract_contact_info_robust
returns None
? Your application needs a plan:
Furthermore, you could integrate Moderation APIs (as discussed earlier in the chapter) before even sending the text to the LLM for extraction or after receiving the response, adding another layer of safety regarding the content itself.
This exercise demonstrates a practical workflow for making LLM interactions more resilient. By combining careful prompting, robust parsing, strict validation, and intelligent error handling (like retries and fallbacks), you can significantly improve the reliability of applications built on Large Language Models. Remember that this is often an iterative process; monitoring failures in production will guide further refinements to your prompts and handling logic.
© 2025 ApX Machine Learning