While Large Language Models (LLMs) excel at generating human-readable text, applications often require data in more structured formats. An LLM might return a perfectly coherent paragraph describing a person, but your application might need the person's name, job title, and location as separate fields. This is where LangChain's Output Parsers come into play.
Output Parsers are classes designed to structure the text output from an LLM. They work in two main ways:
Let's look at some commonly used Output Parsers in LangChain.
As the name suggests, SimpleJsonOutputParser
is designed to parse simple JSON objects from the LLM's output. It's useful when you need a straightforward dictionary structure.
# Assuming 'llm' is an initialized LangChain LLM instance
# and 'ChatPromptTemplate' and 'StrOutputParser' are imported
from langchain_core.output_parsers import SimpleJsonOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI # Example LLM provider
# Example: Replace with your actual LLM initialization
# Ensure OPENAI_API_KEY is set in your environment
llm = ChatOpenAI(model="gpt-3.5-turbo")
# Define the prompt, asking for JSON output
prompt_template = """
Extract the name and primary skill from the following job description:
{description}
Return the result as a JSON object with keys "name" and "skill".
"""
prompt = ChatPromptTemplate.from_template(prompt_template)
# Create the parser instance
json_parser = SimpleJsonOutputParser()
# Create the chain
chain = prompt | llm | json_parser
# Run the chain
job_description = "We are hiring a Senior Python Developer proficient in web frameworks and cloud services."
result = chain.invoke({"description": job_description})
print(result)
# Expected output (may vary slightly based on LLM):
# {'name': 'Senior Python Developer', 'skill': 'Python'}
This parser expects the LLM output to be a string containing a valid JSON object. If the LLM fails to produce valid JSON, it will likely raise a parsing error.
For more complex data structures and added validation, PydanticOutputParser
is an excellent choice. It integrates with Pydantic, a popular Python library for data validation and settings management. You define your desired output structure using a Pydantic model, and the parser handles both generating formatting instructions and parsing the LLM output into an instance of your model.
First, define your data structure using Pydantic:
# Requires 'pip install pydantic'
from pydantic import BaseModel, Field
from typing import List
class PersonInfo(BaseModel):
name: str = Field(description="The person's full name")
age: int = Field(description="The person's age")
hobbies: List[str] = Field(description="A list of the person's hobbies")
Now, use PydanticOutputParser
with this model:
from langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate
# Assuming 'llm' is an initialized LangChain LLM instance
# from langchain_openai import ChatOpenAI
# llm = ChatOpenAI(model="gpt-3.5-turbo")
# Set up a parser + inject instructions into the prompt template.
parser = PydanticOutputParser(pydantic_object=PersonInfo)
# Get format instructions to guide the LLM
format_instructions = parser.get_format_instructions()
# Define the prompt template including the format instructions
prompt_template_str = """
Extract information about a person from the following text:
{text_input}
{format_instructions}
"""
prompt = PromptTemplate(
template=prompt_template_str,
input_variables=["text_input"],
partial_variables={"format_instructions": format_instructions}
)
# Create the chain
chain = prompt | llm | parser
# Input text
text = "Alice is 30 years old and enjoys painting, hiking, and coding."
# Run the chain
person_object = chain.invoke({"text_input": text})
print(person_object)
# Expected output:
# name='Alice' age=30 hobbies=['painting', 'hiking', 'coding']
print(f"Name: {person_object.name}")
print(f"Age: {person_object.age}")
print(f"Hobbies: {person_object.hobbies}")
# Name: Alice
# Age: 30
# Hobbies: ['painting', 'hiking', 'coding']
The get_format_instructions()
method generates text describing the required JSON schema (based on the Pydantic model), which helps the LLM format its output correctly. Using Pydantic models provides automatic validation; if the LLM output doesn't conform to the PersonInfo
schema (e.g., provides text instead of a number for age), Pydantic will raise a validation error.
When you simply need a list of items, CommaSeparatedListOutputParser
is straightforward. It instructs the LLM to return a comma-separated list and then parses that string into a Python list.
from langchain.output_parsers import CommaSeparatedListOutputParser
from langchain_core.prompts import ChatPromptTemplate
# Assuming 'llm' is an initialized LangChain LLM instance
# from langchain_openai import ChatOpenAI
# llm = ChatOpenAI(model="gpt-3.5-turbo")
# Create the parser
list_parser = CommaSeparatedListOutputParser()
# Get format instructions
format_instructions = list_parser.get_format_instructions()
# Define the prompt
prompt_template = """
List 5 popular Python web frameworks.
{format_instructions}
"""
prompt = ChatPromptTemplate.from_template(prompt_template)
# Create the chain
chain = prompt | llm | list_parser
# Run the chain
result = chain.invoke({}) # No specific input needed for this prompt
print(result)
# Expected output (list order and specific frameworks may vary):
# ['Django', 'Flask', 'FastAPI', 'Pyramid', 'Bottle']
StructuredOutputParser
offers a more general way to define multiple output fields without needing Pydantic. You define the desired fields and their descriptions. Like the Pydantic parser, it generates formatting instructions.
from langchain.output_parsers import StructuredOutputParser, ResponseSchema
from langchain_core.prompts import PromptTemplate
# Assuming 'llm' is an initialized LangChain LLM instance
# from langchain_openai import ChatOpenAI
# llm = ChatOpenAI(model="gpt-3.5-turbo")
# Define the desired output schema
response_schemas = [
ResponseSchema(name="answer", description="The answer to the user's question."),
ResponseSchema(name="source", description="The source used to find the answer, should be a website URL if possible.")
]
# Create the parser
output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
# Get format instructions
format_instructions = output_parser.get_format_instructions()
# Define the prompt template
prompt_template_str = """
Answer the user's question as accurately as possible.
{format_instructions}
Question: {question}
"""
prompt = PromptTemplate(
template=prompt_template_str,
input_variables=["question"],
partial_variables={"format_instructions": format_instructions}
)
# Create the chain
chain = prompt | llm | output_parser
# Run the chain
question = "What is the capital of France?"
result = chain.invoke({"question": question})
print(result)
# Expected output (source might vary or be estimated by the LLM):
# {'answer': 'The capital of France is Paris.', 'source': 'General knowledge / Wikipedia'}
As seen in the examples, Output Parsers are typically the last step in a LangChain chain. The basic structure is:
Prompt -> LLM -> Output Parser
The prompt formats the input and includes any necessary formatting instructions from the parser. The LLM generates the text response. The output parser then takes this text and transforms it into the desired Python structure.
LLMs don't always follow instructions perfectly. Sometimes, their output might not match the format expected by the parser (e.g., missing quotes in JSON, incorrect data types). When this happens, the parse()
method of the output parser will typically raise an exception.
In production applications, you'll need robust error handling. This might involve:
OutputFixingParser
that attempt to automatically fix malformed output by feeding the error back to the LLM.Choosing the right Output Parser depends on the complexity of the data you need to extract and whether you require validation. They are essential tools for making LLM outputs usable in downstream application logic, turning freeform text into actionable, structured data.
© 2025 ApX Machine Learning