As established, large documents pose a challenge for both efficient retrieval and the limited context windows of Large Language Models (LLMs). Chunking breaks down these documents into smaller, digestible pieces. The most direct approach is fixed-size chunking, where we split the text based purely on length, ignoring the actual content structure for simplicity.
This strategy comes in two main flavors: character-based and token-based splitting.
This is the simplest method. You define a chunk size, say, 1000 characters, and simply split the document text every 1000 characters.
Example: If your text is "The quick brown fox jumps over the lazy dog." and your chunk size is 20 characters:
Advantages:
Disadvantages:
A more common and generally preferred fixed-size approach aligns better with how LLMs process information: splitting by token count. Tokens are the basic units of text (words, sub-words, punctuation) that language models work with.
To do this, you need a tokenizer, typically the one associated with the embedding model or the final LLM you plan to use. You specify a chunk size in terms of the number of tokens (e.g., 512 tokens). The text is first tokenized, and then split into chunks containing the desired number of tokens.
Example:
Using a tokenizer, "The quick brown fox jumps over the lazy dog." might become tokens like [The] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog] [.]
. If the chunk size is 5 tokens:
[The] [quick] [brown] [fox] [jumps]
-> Text: "The quick brown fox jumps"[over] [the] [lazy] [dog] [.]
-> Text: "over the lazy dog."Advantages:
Disadvantages:
Regardless of whether you split by characters or tokens, simply dividing the text into consecutive, non-overlapping chunks can be problematic. Imagine a sentence describing a specific concept starting near the end of one chunk and finishing at the beginning of the next. A query related to that concept might only match one of the chunks strongly, potentially missing the full context.
To mitigate this, we introduce chunk overlap. This means that consecutive chunks share some content at their boundaries. For instance, if you have a chunk size of 512 tokens, you might specify an overlap of 50 tokens. Chunk 1 would contain tokens 1-512, Chunk 2 would contain tokens 463-974 (512 tokens starting 50 tokens before the end of Chunk 1), Chunk 3 would contain tokens 925-1436, and so on.
Original Text: [----- Section A -----][----- Section B -----][----- Section C -----][----- Section D -----]
No Overlap Chunking (Chunk Size = 2 Sections):
Chunk 1: [----- Section A -----][----- Section B -----]
Chunk 2: [----- Section C -----][----- Section D -----]
*Risk: Information spanning the end of B and start of C might be lost.*
Chunking with Overlap (Chunk Size = 2 Sections, Overlap = 1 Section):
Chunk 1: [----- Section A -----][----- Section B -----]
Chunk 2: [----- Section B -----][----- Section C -----]
Chunk 3: [----- Section C -----][----- Section D -----]
*Benefit: Information spanning B and C is fully contained within Chunk 2.*
Overlap helps ensure that semantic context flowing across chunk boundaries is preserved within at least one chunk, increasing the likelihood that retrieval will find all relevant information for a given query.
Selecting the optimal chunk_size
and chunk_overlap
is more of an art than a science and often requires experimentation. Consider these factors:
The ideal values depend on your specific documents (average sentence length, paragraph structure), the embedding model used (its optimal input length), and the LLM's context window size. You'll likely need to test different combinations and evaluate the impact on retrieval quality (covered in Chapter 6).
Most RAG frameworks and libraries (like LangChain or LlamaIndex) provide convenient functions for fixed-size chunking with overlap, handling both character and token splitting. When using token splitting, ensure you configure the splitter with the correct tokenizer for your chosen embedding model.
While simple and often effective as a starting point, fixed-size chunking fundamentally ignores the natural structure of the document (paragraphs, sections, headings). This limitation motivates the exploration of more sophisticated, content-aware chunking methods, which we will discuss next.
© 2025 ApX Machine Learning