Extracting Structured Data from LLM Outputs

While generating free-form text is useful, many applications require predictable, machine-readable data. For example, you might need to extract user details from a support ticket, categorize customer feedback, or get a list of action items from a meeting transcript. In these cases, receiving a simple string is not enough; you need structured data like JSON. Here are techniques for guiding an LLM to produce structured output and reliably parsing that output into usable data objects in your application.

The process involves two main steps: first, prompting the model to generate a response in the desired format, and second, parsing its output to handle the inevitable variations and imperfections.

Guiding the LLM to Produce JSON

The most direct way to get structured data is to ask for it. You can include instructions in your prompt that specify the exact JSON format you need, including the keys and value types.

For an even more reliable approach, many modern LLMs support a dedicated "JSON mode." This mode constrains the model to only output syntactically correct JSON, significantly reducing the chances of parsing errors. You can enable this feature using a GenerationConfig.

from kerb.generation import generate, GenerationConfig

# A prompt asking for structured data
prompt = """
Extract the user's name, email, and age from the following text and
return it as a JSON object:

'My name is Alice Johnson. I am 28 years old. My email is [email protected].'
"""

# Configure the generation call to enforce JSON output
json_config = GenerationConfig(
    response_format={"type": "json_object"}
)

# The model will be constrained to produce only valid JSON
response = generate(prompt, config=json_config)

# The raw content will be a JSON string
print(response.content)
# {"name": "Alice Johnson", "email": "[email protected]", "age": 28}

Using a model's JSON mode is the preferred method, as it provides a strong guarantee that the output will be parseable.

The Challenge of Parsing LLM Outputs

Even when you instruct an LLM to return JSON, its output can be unpredictable. The model might:

Wrap the JSON in a markdown code block (e.g., ```json ... ```).
Add conversational text before or after the JSON (e.g., "Certainly, here is the JSON you requested: ...").
Produce slightly malformed JSON, such as adding a trailing comma.

A simple call to json.loads() is often too brittle for production use. Your application needs a more resilient parsing strategy that can handle these variations gracefully.

Reliable JSON Extraction

The parsing module provides tools specifically designed to find and extract JSON from messy text. The extract_json function intelligently scans the text, identifies a valid JSON object or array, and parses it.

For example, an LLM response that includes both conversational text and a JSON code block:

from kerb.parsing import extract_json

llm_output = """
Here is the user data you requested:

```json
{
  "name": "Alice Johnson",
  "email": "[email protected]",
  "age": 28,
  "roles": ["developer", "team_lead"]
}
```

This data was extracted from the user database.
"""

result = extract_json(llm_output)

if result.success:
    print("Successfully extracted data:")
    print(result.data)
else:
    print(f"Failed to extract JSON: {result.error}")

The extract_json function automatically identifies the JSON within the markdown block, ignoring the surrounding text. The function returns a ParseResult object, which contains the parsed data, a success flag, and any warnings encountered during parsing.

Handling Imperfect JSON with Parse Modes

Sometimes an LLM produces JSON that is almost, but not quite, valid. For example, it might miss quotes around keys or include a trailing comma. The parse_json function can be configured with a ParseMode to handle these situations.

ParseMode.STRICT: The default mode. Expects perfectly formatted JSON.
ParseMode.LENIENT: Attempts to fix common syntax errors like missing quotes and trailing commas.
ParseMode.BEST_EFFORT: Scans the text for anything that looks like a JSON object and attempts to parse it, even if it's embedded within other text.

Here is how ParseMode.LENIENT can automatically correct a malformed JSON string:

from kerb.parsing import parse_json, ParseMode

# Malformed JSON with missing quotes and a trailing comma
malformed_json_string = """{
    name: "Bob",
    age: 35,
    active: true,
}"""

result = parse_json(malformed_json_string, mode=ParseMode.LENIENT)

if result.success:
    print(f"Parsing was successful: {result.success}")
    print(f"Data was automatically fixed: {result.fixed}")
    print("Parsed Data:", result.data)

This leniency is important for building reliable applications on top of LLMs, as it reduces failures from minor formatting issues.

Validating Structure with Pydantic Models

Extracting JSON into a Python dictionary is a good first step, but for more applications, you'll want to validate the structure and types of the data. Pydantic is a popular library for data validation, and you can parse LLM outputs directly into Pydantic models.

First, define your desired data structure using a Pydantic BaseModel. This creates a clear and validated contract for your data.

from pydantic import BaseModel, Field, validator
from typing import List

class UserProfile(BaseModel):
    """A model to represent a user's profile."""
    name: str = Field(..., description="User's full name")
    email: str = Field(..., description="User's email address")
    age: int = Field(..., ge=18, description="User's age, must be 18 or older")
    roles: List[str] = Field(default_factory=list, description="List of user roles")

    @validator('email')
    def validate_email_format(cls, v):
        if '@' not in v:
            raise ValueError('Invalid email format')
        return v

With your model defined, you can use parse_to_pydantic to parse and validate the LLM's JSON output in a single step.

from kerb.parsing import parse_to_pydantic

llm_output = """
```json
{
  "name": "Alice Johnson",
  "email": "[email protected]",
  "age": 28,
  "roles": ["developer", "team_lead"]
}
```
"""

result = parse_to_pydantic(llm_output, UserProfile)

if result.success:
    user_profile: UserProfile = result.data
    print(f"Successfully parsed profile for {user_profile.name}")
    print(f"Email: {user_profile.email}")
    print(f"Roles: {', '.join(user_profile.roles)}")
else:
    print(f"Validation Error: {result.error}")

If the JSON is missing required fields, contains incorrect data types, or fails any custom validators (like our email format check), parse_to_pydantic will return a failure result with a descriptive error. This ensures that your application only works with clean, validated data.

Closing the Loop: Generating Schemas for Better Prompts

You can improve the reliability of structured data generation by providing the model with the exact schema you expect. The pydantic_to_schema utility converts a Pydantic model into a JSON Schema definition, which you can then include in your prompt.

from kerb.prompt import render_template
from kerb.parsing import pydantic_to_schema
import json

# Generate the JSON Schema from our Pydantic model
user_schema = pydantic_to_schema(UserProfile)

# Create a prompt template that includes the schema
prompt_template = """
Extract user information from the text and format it according to the following JSON Schema.
Only return the JSON object.

Text: 'The new team lead is Charlie, age 35. His email is [email protected].'

Schema:
{{schema}}
"""

# Render the final prompt
final_prompt = render_template(
    prompt_template,
    {"schema": json.dumps(user_schema, indent=2)}
)

print(final_prompt)

This workflow creates a positive feedback loop: your Pydantic model defines the data contract, its schema guides the LLM to produce compliant output, and the parser validates the final result against that same model. This combination of prompting techniques and parsing gives you the control needed to build reliable, data-driven applications.

Was this section helpful?

References

Pydantic Documentation, Samuel Colvin and Pydantic Contributors, 2023 - Comprehensive guide to defining and validating data structures using Python type hints, essential for robust structured data applications.
OpenAI API Reference: JSON Mode, OpenAI, 2024 - Official guide on using the JSON mode feature in OpenAI's API to ensure LLM outputs are valid JSON, a key technique for structured data extraction.