The Rationale Behind Text Chunking

Breaking down documents into manageable pieces is a significant step in preparing data for retrieval. This process, known as chunking, is fundamental to building effective RAG systems. The reason for it lies in a core limitation of all Large Language Models: their finite context window.

Think of the context window as the model's working memory. It's the total amount of information, including your instructions, questions, and any provided documents, that the model can process at one time. This limit isn't measured in pages or words but in tokens, which are pieces of words. A model with an 8,000-token context window cannot process a 20,000-token document in a single request. Attempting to do so would result in an error, and even if it did fit, processing a very large context is often slow and expensive.

This limitation presents us with a challenge. If we can't give the model the entire document, how can we ensure it has the right information to answer a question? This is where chunking becomes essential. By splitting a large document $D$ into a series of smaller, indexed chunks $\{c_1, c_2, ..., c_n\}$ , we transform the problem from "read this entire book" to "find the most relevant page."

The Challenge of Chunk Size

The effectiveness of a RAG system depends heavily on the quality of its chunks. This creates a "Goldilocks" problem where the size of the chunks must be just right.

If chunks are too large: They might contain too much information that is irrelevant to the user's query. This noise can dilute the important facts, making it harder for the LLM to locate the exact answer. Imagine asking for a specific ingredient and being handed an entire cookbook. The information is there, but it's buried.
If chunks are too small: They may not contain enough surrounding context to be meaningful. A single sentence might be ambiguous without the sentences that precede and follow it. This can cause the LLM to misinterpret the information or fail to answer a question because the necessary details were split across different chunks.

The goal is to create chunks that are small enough for precise retrieval but large enough to be semantically complete and provide sufficient context for the LLM.

The RAG workflow. A large document is first split into smaller chunks. When a user asks a query, a retrieval system finds the most relevant chunk, which is then passed to the LLM as context to generate an answer.

An Initial Approach to Chunking

With the rationale established, let's explore a basic implementation. The simplest way to chunk text is to split it based on a fixed number of characters. While this method doesn't respect sentence or paragraph boundaries, it clearly demonstrates the core principle of division.

The toolkit provides simple_chunker for this purpose. Let's take a sample text and split it into 100-character chunks.

from kerb.chunk import simple_chunker

text = """
Artificial intelligence (AI) is transforming technology.
Machine learning enables computers to learn from data without explicit programming.
Natural language processing allows machines to understand and generate human language.
""".strip()

# Split into chunks of approximately 100 characters
chunks = simple_chunker(text, chunk_size=100)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1} ({len(chunk)} chars): '{chunk}'")

Running this code produces the following output:

Artificial intelligence (AI) is transforming technology.
Machine learning enables compute'
Chunk 2 (100 chars): 'rs to learn from data without explicit programming.
Natural language processing allows machines to unde'
Chunk 3 (62 chars): 'rstand and generate human language.'

As you can see, this approach is straightforward but crude. It splits the text right in the middle of words like "computers" and "understand". This loss of context at the chunk boundaries can harm retrieval quality. A query about "understanding language" might fail to match the third chunk effectively because the word "understand" is incomplete.

This basic method highlights the need for more intelligent chunking strategies, which we will explore in the next section. By respecting the natural structure of the text, such as sentences and paragraphs, we can create more meaningful chunks that preserve context and lead to better RAG performance.

Was this section helpful?

References

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela, 2020 Advances in Neural Information Processing Systems, Vol. 33 (Advances in Neural Information Processing Systems) DOI: 10.48550/arXiv.2005.11401 - Introduces Retrieval-Augmented Generation (RAG) and outlines its fundamental architecture, including the preparation of knowledge sources.
Understanding tokens and context windows, OpenAI, 2025 (OpenAI) - Provides a clear explanation of how tokens work, the concept of context windows, and their importance in Large Language Models.
A Survey on Retrieval-Augmented Generation, Yunfan Shao, Zhicheng Dou, Jiantao Ji, Xiaoxue Li, et al., 2024 arXiv preprint arXiv:2401.12192 (arXiv) DOI: 10.48550/arXiv.2401.12192 - A recent survey offering a comprehensive review of Retrieval-Augmented Generation, with sections dedicated to data preparation techniques like text chunking.
Text Splitters Overview, LangChain (LangChain) - Describes various text splitting strategies used in RAG systems, illustrating methods beyond simple fixed-size chunks to preserve context.