Applying Chunking Strategies

Partitioning loaded data into smaller, manageable pieces is an essential step to prepare it for embedding and retrieval. The method used to split text can significantly influence the performance of a RAG system. Different strategies offer trade-offs between simplicity, semantic coherence, and computational overhead. Several common and effective chunking strategies are explored.

Fixed-Size Chunking

The most direct approach is fixed-size chunking. This method splits a document into segments of a predetermined length, measured in characters. It is simple to implement and computationally efficient.

However, its main disadvantage is that it has no regard for the structure or meaning of the text. This can lead to chunks that end abruptly mid-sentence or even mid-word, which can disrupt the semantic integrity of the information presented to the Language Model.

You can implement a basic fixed-size chunker using the chunk_text utility.

from kerb.chunk import chunk_text

text = "Artificial intelligence is transforming industries. Machine learning enables computers to learn from data. Natural language processing helps them understand text."

# Split into 80-character chunks with no overlap
chunks = chunk_text(text, chunk_size=80, overlap=0)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: '{chunk}'")

The output demonstrates the problem with this naive approach. Notice how the second chunk begins with a fragment of the previous sentence.

Chunk 1: 'Artificial intelligence is transforming industries. Machine learning enables compute'
Chunk 2: 'rs to learn from data. Natural language processing helps them understand text.'

To mitigate this loss of context, you can introduce an overlap between consecutive chunks. An overlap ensures that a small portion of text from the end of one chunk is repeated at the beginning of the next. This helps preserve the local context around the split point, which is beneficial for the embedding model.

from kerb.chunk import chunk_text

text = "Artificial intelligence is transforming industries. Machine learning enables computers to learn from data. Natural language processing helps them understand text."

# Split into 80-character chunks with a 15-character overlap
chunks_with_overlap = chunk_text(text, chunk_size=80, overlap=15)

for i, chunk in enumerate(chunks_with_overlap):
    print(f"Chunk {i+1}: '{chunk}'")

The output now shows how the end of the first chunk overlaps with the beginning of the second, providing better context for a model processing either chunk.

Chunk 1: 'Artificial intelligence is transforming industries. Machine learning enables compute'
Chunk 2: 'enables computers to learn from data. Natural language processing helps them underst'
Chunk 3: 'elps them understand text.'

While overlap helps, fixed-size chunking remains a blunt instrument. For most applications, a more structure-aware approach is preferable.

Recursive Character Splitting

A more sophisticated method is recursive character splitting. This strategy attempts to preserve semantic boundaries by splitting the text based on a hierarchical list of separators. It starts by trying to split the text by the highest-priority separator (like double newlines, which often separate paragraphs). If the resulting chunks are still too large, it recursively applies the next separator in the hierarchy (single newlines, sentences, words, and finally individual characters) until all chunks are within the desired size.

The recursive splitting process attempts to use the largest possible semantic separator before moving to smaller ones.

This top-down approach is effective because it prioritizes keeping high-level semantic units like paragraphs together. The default separator hierarchy is ["\n\n", "\n", ". ", " ", ""].

You can use the recursive_chunker function to apply this strategy.

from kerb.chunk import recursive_chunker

multi_paragraph_text = """
Retrieval-Augmented Generation (RAG) is a powerful technique for LLM applications. It combines the benefits of retrieval systems with generative models.

The process works in several steps. First, documents are chunked and embedded. Then, relevant chunks are retrieved based on the query. Finally, the LLM generates a response using the retrieved context.

Vector databases play an important role in RAG systems.
""".strip()

# Create chunks that respect paragraph boundaries where possible
chunks = recursive_chunker(multi_paragraph_text, chunk_size=200)

for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i} ({len(chunk)} chars):")
    print(f"  '{chunk}'\n")

The output shows how this method successfully keeps the first paragraph intact as a single chunk because it fits within the 200-character limit. The second, longer paragraph is split at a sentence boundary.

Chunk 1 (148 chars):
  'Retrieval-Augmented Generation (RAG) is a powerful technique for LLM applications. It combines the benefits of retrieval systems with generative models.'

Chunk 2 (198 chars):
  'The process works in several steps. First, documents are chunked and embedded. Then, relevant chunks are retrieved based on the query. Finally, the LLM generates a response using the retrieved context.'

Chunk 3 (62 chars):
  Vector databases play an important role in RAG systems.

For most unstructured or semi-structured text, recursive character splitting is a reliable and recommended default strategy.

Sentence-Based Chunking

For RAG systems, especially in question-answering scenarios, ensuring that each chunk contains complete sentences is highly beneficial. Sentence-based chunking first splits the entire document into individual sentences and then groups them into chunks. This method guarantees that no sentence is ever broken across two chunks.

Like fixed-size chunking, this method also benefits from an overlap, but at the sentence level. For example, an overlap of one sentence means the last sentence of chunk N becomes the first sentence of chunk N+1.

The sentence_window_chunker implements this strategy. You define the window size in terms of sentences and the overlap in sentences.

from kerb.chunk import sentence_window_chunker

article_text = "Vector databases are essential for RAG. They store embeddings and enable fast search. Pinecone is a popular managed solution. Weaviate is a flexible open-source alternative. Chroma is great for local development. Choosing the right one depends on your needs."

# Group text into chunks of 3 sentences, with a 1-sentence overlap
chunks = sentence_window_chunker(
    article_text,
    window_sentences=3,
    overlap_sentences=1
)

for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i}:")
    print(f"  '{chunk}'\n")

The output clearly illustrates the windowing and overlap behavior.

Chunk 1:
  'Vector databases are essential for RAG. They store embeddings and enable fast search. Pinecone is a popular managed solution.'

Chunk 2:
  'Pinecone is a popular managed solution. Weaviate is a flexible open-source alternative. Chroma is great for local development.'

Chunk 3:
  'Chroma is great for local development. Choosing the right one depends on your needs.'

Notice how "Pinecone is a popular managed solution." appears at the end of the first chunk and the beginning of the second. This overlap provides important contextual linkage that improves retrieval quality, as a user's query might semantically match content near the boundary of two chunks.

Choosing Your Strategy

The right chunking strategy depends on your data and application.

Fixed-Size with Overlap: Use for its simplicity and speed, particularly with highly structured or uniform text where semantic boundaries are less significant.
Recursive Splitting: The best general-purpose choice. It intelligently adapts to the document's structure, making it a default for mixed or unknown content.
Sentence-Based Splitting: Highly recommended for RAG systems. It provides clean, semantically complete chunks that are ideal for generating accurate embeddings and providing clear context to the LLM.

After applying a chunking strategy, you will have a set of text chunks ready for further refinement. The next step in our pipeline is to apply text preprocessing to clean and normalize this content, further improving its quality for the embedding model.

Was this section helpful?

References

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Patrick Lewis, Ethan Perez, Aleksa Gordić, Vladimir Karpukhin, Myle Ott, Sebastian Riedel, Douwe van der Vaart, 2020 Advances in Neural Information Processing Systems (NeurIPS 2020), Vol. 33 (Neural Information Processing Systems Foundation) DOI: 10.48550/arXiv.2005.11401 - This seminal paper introduced the Retrieval-Augmented Generation (RAG) framework, establishing the need for effective data preparation methods like text chunking for enhanced language model performance.
Text splitters, LangChain Documentation, 2024 - The official documentation provides explanations and implementation details for various text splitting strategies, including fixed-size and recursive methods, as used in practical RAG applications.
The Impact of Document Chunking Strategies on Retrieval-Augmented Generation Performance, Luyang Weng, Yujie Cheng, Hongye Liu, Yifan Liu, 2023 arXiv preprint arXiv:2312.00067 DOI: 10.48550/arXiv.2312.00067 - This research paper examines how different text chunking strategies influence the overall performance of Retrieval-Augmented Generation systems. It presents empirical findings.