Splitting Documents for Processing

When working with LLM applications, documents are often represented as Document objects, which can contain a substantial amount of text. Directly feeding a large document to an LLM for question-answering is impractical for two primary reasons. First, it will likely exceed the model's context window limit. Second, providing a large, unfocused block of text makes it difficult for the model to locate the specific information needed to answer a query.

The solution is to break down large documents into smaller, semantically meaningful chunks. This process, known as text splitting or chunking, is a significant step in preparing data for a RAG system. The goal is to create text fragments that are small enough to be efficiently processed by an embedding model yet large enough to retain their original context.

The Chunking Trade-off: Size vs. Context

Choosing the right chunk size involves a delicate balance. If chunks are too small, the semantic context can be lost. For example, a chunk containing only the phrase "it was a major breakthrough" is meaningless without the preceding sentences that explain what "it" refers to. On the other hand, if chunks are too large, they may contain irrelevant information that creates noise during retrieval, making it harder for the vector search to identify the most relevant passage. This is often called the "lost in the middle" problem, where important information is buried within a large, noisy text block.

Different chunking strategies for the same text. Small chunks can fragment meaning, while large chunks can introduce noise. The objective is to find a balance that preserves semantic context.

Recursive Character Splitting

The most recommended and versatile method for text splitting in LangChain is the RecursiveCharacterTextSplitter. Instead of splitting on a fixed character, this splitter works with a prioritized list of separators. It attempts to split the text using the first separator in the list (by default, ["\n\n", "\n", " ", ""]). If the resulting chunks are still too large, it moves to the next separator, and so on.

This hierarchical approach naturally aligns with the structure of written text. It tries to keep paragraphs together first, then sentences, and finally words, ensuring the resulting chunks are as semantically coherent as possible.

Two important parameters control its behavior:

chunk_size: Defines the maximum length of each chunk, measured in characters.
chunk_overlap: Specifies the number of characters to overlap between adjacent chunks. This is a valuable feature that helps maintain continuity of context. If a sentence is cut off at the end of one chunk, the overlap ensures it is completed in the next, reducing the chance of losing relationships between sentences.

Here is how you can use it in practice:

from langchain_text_splitters import RecursiveCharacterTextSplitter

# Some long text loaded from a document
long_text = """LangChain provides a comprehensive framework for building LLM applications. It simplifies the process of chaining models, managing prompts, and connecting to data sources.

One of its main components is the TextSplitter. This utility is essential for Retrieval Augmented Generation (RAG), as it breaks down large documents into manageable chunks. These chunks are then embedded and stored in a vector database for efficient retrieval. The choice of chunk size and overlap is significant for application performance."""

# Initialize the splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=20
)

# Split the document
chunks = text_splitter.split_text(long_text)

# Print the resulting chunks
for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} ---\n{chunk}\n")

Running this code produces the following output, demonstrating how the text is divided while the overlap preserves context at the boundaries.

--- Chunk 1 ---
LangChain provides a comprehensive framework for building LLM applications. It simplifies the process of chaining models, managing prompts, and connecting

--- Chunk 2 ---
connecting to data sources.

One of its main components is the TextSplitter. This utility is essential for Retrieval Augmented Generation (RAG), as it

--- Chunk 3 ---
it breaks down large documents into manageable chunks. These chunks are then embedded and stored in a vector database for efficient retrieval. The choice

--- Chunk 4 ---
choice of chunk size and overlap is significant for application performance.

The overlap is clearly visible; "connecting" appears at the end of the first chunk and the start of the second, ensuring the full phrase is available for context.

Selecting an Appropriate Chunk Size

The optimal chunk_size depends on the specifics of your data and the embedding model you intend to use. Most embedding models have an input token limit (e.g., 512 or 1024 tokens), and you want your chunks to comfortably fit within that limit. A character count does not map directly to a token count, but a common heuristic is that one token is approximately four characters in English.

A good starting point for many applications is a chunk_size of around 1000 characters and a chunk_overlap of 200. These values provide a solid balance, but you should experiment to find what works best for your use case. Visualizing the distribution of chunk lengths after splitting can help confirm that your configuration is behaving as expected.

A histogram showing the length of the four chunks generated in the example. Most are near the chunk_size limit of 150 characters, with the final chunk being shorter.

Other Splitting Strategies

While RecursiveCharacterTextSplitter is a strong default, LangChain offers other splitters tailored for different needs:

CharacterTextSplitter: The simplest splitter. It splits text based on a single separator character, such as \n. It is fast but less effective at preserving semantic boundaries compared to the recursive approach.
TokenTextSplitter: This splitter divides text based on a specific number of tokens rather than characters. This is a more precise method for ensuring chunks do not exceed an LLM's context limit, but it requires an external tokenizer, like tiktoken from OpenAI.

For example, using TokenTextSplitter would look like this:

from langchain_text_splitters import TokenTextSplitter

# This splitter requires a tokenizer
token_splitter = TokenTextSplitter(
    chunk_size=50, # Size in tokens
    chunk_overlap=10, # Overlap in tokens
    model_name="gpt-4o" # Specify the model to ensure accurate token counting
)

token_chunks = token_splitter.split_text(long_text)

This approach provides more predictable chunk sizes in terms of what the model will process. However, for most general-purpose RAG systems, the RecursiveCharacterTextSplitter offers an effective balance of performance and simplicity.

With our documents now processed into well-defined, semantically meaningful chunks, the next step is to convert them into a format that a machine can use for similarity comparison: numerical vectors, or embeddings.

Build LLM apps faster with Kerb

Cleaner syntax. Built-in debugging. Production-ready from day one.

Built for the AI systems behind ApX Machine Learning

Was this section helpful?

References

LangChain Text Splitters, LangChain Development Team, 2024 (LangChain) - Official documentation detailing various text splitting strategies within LangChain, including RecursiveCharacterTextSplitter and TokenTextSplitter, essential for practical implementation.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela, 2020 NeurIPS 2020 DOI: 10.48550/arXiv.2005.11401 - This paper introduces the Retrieval-Augmented Generation (RAG) framework, providing the theoretical context for why breaking down documents into manageable chunks is a fundamental preprocessing step for efficient information retrieval.
Lost in the Middle: How Language Models Use Long Contexts, Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang, 2023 Transactions of the Association for Computational Linguistics (TACL) DOI: 10.48550/arXiv.2307.03172 - Investigates the performance of language models when dealing with long input contexts, identifying the 'lost in the middle' phenomenon that chunking strategies aim to overcome by providing focused text segments.