Loading Data with Document Loaders

The foundation of any RAG system is the data it draws upon. Before an LLM can reason over your information, that information must be loaded and structured in a way the system can understand. This initial step, data ingestion, is handled by LangChain's DocumentLoader components. They provide a standardized way to import data from a multitude of sources, transforming raw content into a uniform format that the rest of the framework can process.

LangChain's Document Abstraction

At the heart of LangChain's data handling is the Document object. Think of it as a standardized container for a piece of text and its associated information. Each Document object has two main attributes:

page_content: A string that holds the actual text content.
metadata: A Python dictionary that contains supplementary information about the content, such as its source, page number, or creation date. This metadata is extremely useful for filtering, tracking provenance, and providing citations in your final application.

Here is how you might create a Document object manually:

from langchain_core.documents import Document

doc = Document(
    page_content="LangChain provides a standard interface for document loaders.",
    metadata={"source": "internal_docs/intro.md", "chapter": 1}
)

print(doc.page_content)
# LangChain provides a standard interface for document loaders.

print(doc.metadata)
# {'source': 'internal_docs/intro.md', 'chapter': 1}

While you can create Document objects by hand, the real utility comes from using DocumentLoaders to generate them automatically from your data sources. A DocumentLoader is an object designed to fetch data from a source and convert it into a list of Document objects.

The Document Loader acts as a unified interface, converting various data sources into a standardized list of Document objects for downstream processing.

Loading from Common Sources

LangChain offers a wide array of document loaders, each tailored for a specific data source. Let's examine a few of the most frequently used ones.

Text Files

The simplest case is loading data from a plain .txt file. For this, we use the TextLoader.

First, let's create a sample file.

# In your terminal
echo "This is the first sentence. This is the second." > sample.txt

Now, we can use TextLoader in Python to load it. The .load() method performs the operation and returns a list of Document objects. For TextLoader, this is typically a list with a single Document containing the entire file's content.

from langchain_community.document_loaders import TextLoader

loader = TextLoader("sample.txt")
docs = loader.load()

print(len(docs))
# 1

print(docs[0].page_content)
# This is the first sentence. This is the second.

print(docs[0].metadata)
# {'source': 'sample.txt'}

PDF Documents

A common requirement is to ingest data from PDF files. The PyPDFLoader integration handles this by loading a PDF and splitting its content by page, creating one Document object for each page. This is a sensible default, as it automatically preserves page boundaries.

Assuming you have a PDF file named product_manual.pdf:

# You may need to install the pypdf library: pip install pypdf
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("product_manual.pdf")
pages = loader.load()

# Assuming the PDF has multiple pages
print(f"Loaded {len(pages)} pages from the PDF.")

# Inspect the first page's document
first_page_doc = pages[0]
print(first_page_doc.page_content[:200]) # Print first 200 characters
# ... content of the first page ...

print(first_page_doc.metadata)
# {'source': 'product_manual.pdf', 'page': 0}

Notice how the metadata now includes a page key, which can be invaluable for referencing the original source.

Web Pages

To load content directly from a URL, you can use the WebBaseLoader. It fetches the HTML from the given URL, parses it, and extracts the text content into a Document. This is an efficient way to pull in online articles, blog posts, or documentation.

# You may need to install BeautifulSoup: pip install beautifulsoup4
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://www.oreilly.com/about/")
docs = loader.load()

print(docs[0].page_content[:500])
# ... a clean text version of the webpage's content ...

print(docs[0].metadata)
# {'source': 'https://www.oreilly.com/about/', 'title': 'About O’Reilly - O’Reilly'}

The metadata for WebBaseLoader typically includes the source URL and the page title, which allows you to reference the original content.

Loading an Entire Directory

In many applications, you'll need to load all documents from a directory. The DirectoryLoader provides a convenient way to do this. You specify a path and a glob pattern to select which files to load. You also must provide the loader class that should be used for the selected files.

For example, to load all Markdown files (.md) from a documentation/ folder using the TextLoader:

from langchain_community.document_loaders import DirectoryLoader, TextLoader

# Assume documentation/ has multiple .md files
loader = DirectoryLoader(
    path="documentation/", 
    glob="**/*.md",          # Load all .md files in all subdirectories
    loader_cls=TextLoader,   # Use TextLoader for these files
    show_progress=True       # Display a progress bar
)

docs = loader.load()
print(f"Loaded {len(docs)} documents from the directory.")

This pattern of combining a DirectoryLoader with a specific file loader is a powerful and efficient method for bulk data ingestion.

Now that we have our data loaded into Document objects, we face a new challenge. A single document, such as a full chapter from a PDF or a long web article, can easily exceed the context window of most language models. To manage this, we must break these large documents into smaller, semantically coherent pieces. This process, known as text splitting, is the focus of our next section.

Build LLM apps faster with Kerb

Cleaner syntax. Built-in debugging. Production-ready from day one.

Built for the AI systems behind ApX Machine Learning

Was this section helpful?

References

LangChain Document Loaders, LangChain, 2024 (LangChain) - Official documentation detailing the various DocumentLoader classes and their use for data ingestion in LangChain.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela, 2020 Advances in Neural Information Processing Systems (NeurIPS), Vol. 33 DOI: 10.48550/arXiv.2005.11401 - A foundational paper introducing the Retrieval-Augmented Generation (RAG) paradigm, offering the high-level design for systems that require data ingestion.
Document Object API Reference, LangChain, 2024 - Official API reference for LangChain's fundamental Document object, detailing its attributes like page_content and metadata.
Data Engineering for AI: A Practical Guide to Building Data Pipelines for LLMs, Generative AI, and Machine Learning, Andrew C. Ng, 2024 (O'Reilly Media) - Covers data engineering principles for AI systems, including practical methods for data collection, preprocessing, and structuring relevant to LLM applications.