When working with LLM applications, documents are often represented as Document objects, which can contain a substantial amount of text. Directly feeding a large document to an LLM for question-answering is impractical for two primary reasons. First, it will likely exceed the model's context window limit. Second, providing a large, unfocused block of text makes it difficult for the model to locate the specific information needed to answer a query.
The solution is to break down large documents into smaller, semantically meaningful chunks. This process, known as text splitting or chunking, is a significant step in preparing data for a RAG system. The goal is to create text fragments that are small enough to be efficiently processed by an embedding model yet large enough to retain their original context.
Choosing the right chunk size involves a delicate balance. If chunks are too small, the semantic context can be lost. For example, a chunk containing only the phrase "it was a major breakthrough" is meaningless without the preceding sentences that explain what "it" refers to. On the other hand, if chunks are too large, they may contain irrelevant information that creates noise during retrieval, making it harder for the vector search to identify the most relevant passage. This is often called the "lost in the middle" problem, where important information is buried within a large, noisy text block.
Different chunking strategies for the same text. Small chunks can fragment meaning, while large chunks can introduce noise. The objective is to find a balance that preserves semantic context.
The most recommended and versatile method for text splitting in LangChain is the RecursiveCharacterTextSplitter. Instead of splitting on a fixed character, this splitter works with a prioritized list of separators. It attempts to split the text using the first separator in the list (by default, ["\n\n", "\n", " ", ""]). If the resulting chunks are still too large, it moves to the next separator, and so on.
This hierarchical approach naturally aligns with the structure of written text. It tries to keep paragraphs together first, then sentences, and finally words, ensuring the resulting chunks are as semantically coherent as possible.
Two important parameters control its behavior:
chunk_size: Defines the maximum length of each chunk, measured in characters.chunk_overlap: Specifies the number of characters to overlap between adjacent chunks. This is a valuable feature that helps maintain continuity of context. If a sentence is cut off at the end of one chunk, the overlap ensures it is completed in the next, reducing the chance of losing relationships between sentences.Here is how you can use it in practice:
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Some long text loaded from a document
long_text = """LangChain provides a comprehensive framework for building LLM applications. It simplifies the process of chaining models, managing prompts, and connecting to data sources.
One of its main components is the TextSplitter. This utility is essential for Retrieval Augmented Generation (RAG), as it breaks down large documents into manageable chunks. These chunks are then embedded and stored in a vector database for efficient retrieval. The choice of chunk size and overlap is significant for application performance."""
# Initialize the splitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=150,
chunk_overlap=20
)
# Split the document
chunks = text_splitter.split_text(long_text)
# Print the resulting chunks
for i, chunk in enumerate(chunks):
print(f"--- Chunk {i+1} ---\n{chunk}\n")
Running this code produces the following output, demonstrating how the text is divided while the overlap preserves context at the boundaries.
--- Chunk 1 ---
LangChain provides a comprehensive framework for building LLM applications. It simplifies the process of chaining models, managing prompts, and connecting
--- Chunk 2 ---
connecting to data sources.
One of its main components is the TextSplitter. This utility is essential for Retrieval Augmented Generation (RAG), as it
--- Chunk 3 ---
it breaks down large documents into manageable chunks. These chunks are then embedded and stored in a vector database for efficient retrieval. The choice
--- Chunk 4 ---
choice of chunk size and overlap is significant for application performance.
The overlap is clearly visible; "connecting" appears at the end of the first chunk and the start of the second, ensuring the full phrase is available for context.
The optimal chunk_size depends on the specifics of your data and the embedding model you intend to use. Most embedding models have an input token limit (e.g., 512 or 1024 tokens), and you want your chunks to comfortably fit within that limit. A character count does not map directly to a token count, but a common heuristic is that one token is approximately four characters in English.
A good starting point for many applications is a chunk_size of around 1000 characters and a chunk_overlap of 200. These values provide a solid balance, but you should experiment to find what works best for your use case. Visualizing the distribution of chunk lengths after splitting can help confirm that your configuration is behaving as expected.
A histogram showing the length of the four chunks generated in the example. Most are near the
chunk_sizelimit of 150 characters, with the final chunk being shorter.
While RecursiveCharacterTextSplitter is a strong default, LangChain offers other splitters tailored for different needs:
\n. It is fast but less effective at preserving semantic boundaries compared to the recursive approach.tiktoken from OpenAI.For example, using TokenTextSplitter would look like this:
from langchain_text_splitters import TokenTextSplitter
# This splitter requires a tokenizer
token_splitter = TokenTextSplitter(
chunk_size=50, # Size in tokens
chunk_overlap=10, # Overlap in tokens
model_name="gpt-4o" # Specify the model to ensure accurate token counting
)
token_chunks = token_splitter.split_text(long_text)
This approach provides more predictable chunk sizes in terms of what the model will process. However, for most general-purpose RAG systems, the RecursiveCharacterTextSplitter offers an effective balance of performance and simplicity.
With our documents now processed into well-defined, semantically meaningful chunks, the next step is to convert them into a format that a machine can use for similarity comparison: numerical vectors, or embeddings.
Cleaner syntax. Built-in debugging. Production-ready from day one.
Built for the AI systems behind ApX Machine Learning
Was this section helpful?
RecursiveCharacterTextSplitter and TokenTextSplitter, essential for practical implementation.© 2026 ApX Machine LearningEngineered with