The foundation of any RAG system is the data it draws upon. Before an LLM can reason over your information, that information must be loaded and structured in a way the system can understand. This initial step, data ingestion, is handled by LangChain's DocumentLoader components. They provide a standardized way to import data from a multitude of sources, transforming raw content into a uniform format that the rest of the framework can process.
At the heart of LangChain's data handling is the Document object. Think of it as a standardized container for a piece of text and its associated information. Each Document object has two main attributes:
page_content: A string that holds the actual text content.metadata: A Python dictionary that contains supplementary information about the content, such as its source, page number, or creation date. This metadata is extremely useful for filtering, tracking provenance, and providing citations in your final application.Here is how you might create a Document object manually:
from langchain_core.documents import Document
doc = Document(
page_content="LangChain provides a standard interface for document loaders.",
metadata={"source": "internal_docs/intro.md", "chapter": 1}
)
print(doc.page_content)
# LangChain provides a standard interface for document loaders.
print(doc.metadata)
# {'source': 'internal_docs/intro.md', 'chapter': 1}
While you can create Document objects by hand, the real utility comes from using DocumentLoaders to generate them automatically from your data sources. A DocumentLoader is an object designed to fetch data from a source and convert it into a list of Document objects.
The Document Loader acts as a unified interface, converting various data sources into a standardized list of
Documentobjects for downstream processing.
LangChain offers a wide array of document loaders, each tailored for a specific data source. Let's examine a few of the most frequently used ones.
The simplest case is loading data from a plain .txt file. For this, we use the TextLoader.
First, let's create a sample file.
# In your terminal
echo "This is the first sentence. This is the second." > sample.txt
Now, we can use TextLoader in Python to load it. The .load() method performs the operation and returns a list of Document objects. For TextLoader, this is typically a list with a single Document containing the entire file's content.
from langchain_community.document_loaders import TextLoader
loader = TextLoader("sample.txt")
docs = loader.load()
print(len(docs))
# 1
print(docs[0].page_content)
# This is the first sentence. This is the second.
print(docs[0].metadata)
# {'source': 'sample.txt'}
A common requirement is to ingest data from PDF files. The PyPDFLoader integration handles this by loading a PDF and splitting its content by page, creating one Document object for each page. This is a sensible default, as it automatically preserves page boundaries.
Assuming you have a PDF file named product_manual.pdf:
# You may need to install the pypdf library: pip install pypdf
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("product_manual.pdf")
pages = loader.load()
# Assuming the PDF has multiple pages
print(f"Loaded {len(pages)} pages from the PDF.")
# Inspect the first page's document
first_page_doc = pages[0]
print(first_page_doc.page_content[:200]) # Print first 200 characters
# ... content of the first page ...
print(first_page_doc.metadata)
# {'source': 'product_manual.pdf', 'page': 0}
Notice how the metadata now includes a page key, which can be invaluable for referencing the original source.
To load content directly from a URL, you can use the WebBaseLoader. It fetches the HTML from the given URL, parses it, and extracts the text content into a Document. This is an efficient way to pull in online articles, blog posts, or documentation.
# You may need to install BeautifulSoup: pip install beautifulsoup4
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://www.oreilly.com/about/")
docs = loader.load()
print(docs[0].page_content[:500])
# ... a clean text version of the webpage's content ...
print(docs[0].metadata)
# {'source': 'https://www.oreilly.com/about/', 'title': 'About O’Reilly - O’Reilly'}
The metadata for WebBaseLoader typically includes the source URL and the page title, which allows you to reference the original content.
In many applications, you'll need to load all documents from a directory. The DirectoryLoader provides a convenient way to do this. You specify a path and a glob pattern to select which files to load. You also must provide the loader class that should be used for the selected files.
For example, to load all Markdown files (.md) from a documentation/ folder using the TextLoader:
from langchain_community.document_loaders import DirectoryLoader, TextLoader
# Assume documentation/ has multiple .md files
loader = DirectoryLoader(
path="documentation/",
glob="**/*.md", # Load all .md files in all subdirectories
loader_cls=TextLoader, # Use TextLoader for these files
show_progress=True # Display a progress bar
)
docs = loader.load()
print(f"Loaded {len(docs)} documents from the directory.")
This pattern of combining a DirectoryLoader with a specific file loader is a powerful and efficient method for bulk data ingestion.
Now that we have our data loaded into Document objects, we face a new challenge. A single document, such as a full chapter from a PDF or a long web article, can easily exceed the context window of most language models. To manage this, we must break these large documents into smaller, semantically coherent pieces. This process, known as text splitting, is the focus of our next section.
Cleaner syntax. Built-in debugging. Production-ready from day one.
Built for the AI systems behind ApX Machine Learning
Was this section helpful?
© 2026 ApX Machine LearningEngineered with