Hands-on Practical: Building a Structured Data Extractor

This practical exercise focuses on building a cohesive LLM application by combining models and prompts to structure data. This application demonstrates a common pattern in LLM development: taking unstructured text as input and producing structured, machine-readable data as output. The goal is to build an application that can read a short biography and extract specific details like a person's name, title, and company.

This process utilizes the model's native structured output capabilities, effectively combining generation and parsing.

The data extraction workflow. Unstructured text is formatted by a prompt, and the model is configured to return a structured Python object directly.

Step 1: Define the Data Schema with Pydantic

Before we can extract information, we must first define the structure of the data we want. A schema acts as a contract for our output, ensuring consistency and predictability. The Pydantic library is the standard for data validation in Python and integrates well with LangChain.

Let's define a PersonProfile schema that includes a name, job title, and company. We can also add descriptions to guide the LLM in correctly identifying each piece of information.

from pydantic import BaseModel, Field
from typing import Optional

class PersonProfile(BaseModel):
    """A structured representation of a person's professional profile."""
    name: str = Field(description="The full name of the person.")
    title: str = Field(description="The professional title or role of the person.")
    company: str = Field(description="The name of the company the person works for.")
    years_of_experience: Optional[int] = Field(
        None, description="The total number of years of professional experience."
    )

By creating this class, we have established a clear target format for our LLM. The descriptions within Field help the model understand the semantics of each attribute.

Step 2: Configure the Model and Prompt

With our data schema defined, we can now set up the extractor. Modern LLMs like OpenAI's gpt-4o-mini support structured output natively, which is more reliable than prompt-based parsing.

We will use the .with_structured_output() method to bind our Pydantic schema to the model. This tells the model to conform its output to the PersonProfile class. We also define a simple prompt template to pass the input text to the model.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

# Define the prompt template
prompt = ChatPromptTemplate.from_template("Extract information from the following text.\nText: {query}")

# Initialize the model
llm = ChatOpenAI(temperature=0, model="gpt-4o-mini")

# Configure the model for structured output
structured_llm = llm.with_structured_output(PersonProfile)

This configuration simplifies the pipeline by handling the schema injection and parsing logic internally.

Step 3: Combine Components into a Chain

Now we connect our components into a processing pipeline using the LangChain Expression Language (LCEL). The pipe symbol (|) connects the elements, creating a sequence where the output of one step becomes the input to the next.

# Create the chain
extractor_chain = prompt | structured_llm

Running this chain is straightforward. We only need to provide the query variable.

Step 4: Run the Extractor

Let's test our extractor with a sample piece of text. We will invoke the chain and inspect the output.

# Input text
text_input = """
Alex Thompson is the Senior Data Scientist at InnovateCorp, 
where he has been leading the AI research division for the past 5 years.
"""

# Invoke the chain
result = extractor_chain.invoke({"query": text_input})

# Print the structured output
print(result)
print(f"\nType of result: {type(result)}")

The expected output will be a PersonProfile object, not a simple string or dictionary.

name='Alex Thompson' title='Senior Data Scientist' company='InnovateCorp' years_of_experience=5

Type of result: <class '__main__.PersonProfile'>

Success. The chain correctly processed the unstructured sentence and returned a Pydantic object. We can now access the data reliably using standard object attributes, such as result.name or result.company. This example highlights how using a model's structured output capability creates a dependable bridge from unstructured language to structured data.

Build LLM apps faster with Kerb

Cleaner syntax. Built-in debugging. Production-ready from day one.

Built for the AI systems behind ApX Machine Learning

Was this section helpful?

References

LangChain Output Parsers Documentation, LangChain Developers, 2025 (LangChain Inc.) - Official guide on converting LLM outputs into structured formats.
Pydantic v2: Models Documentation, Pydantic Developers, 2024 - Details on defining data schemas using Pydantic BaseModel and Field.
Prompt Engineering for Developers Course, Andrew Ng, Isa Fulford, 2023 (DeepLearning.AI) - Provides practical methods for designing effective prompts for LLMs.