New Masterclass:How to Build a Large Language Model

Read Now →

5 Chunking Techniques for Retrieval-Augmented Generation (RAG)

By Lea M. on May 20, 2025

Guest Author

Retrieval Augmented Generation (RAG) systems enhance Large Language Model (LLM) responses by providing relevant external knowledge. A fundamental step in building effective RAG systems is chunking, the process of dividing large documents into smaller, digestible pieces. The quality of these chunks directly influences the relevance of the retrieved context and, consequently, the accuracy and usefulness of the LLM's output.

Getting chunking right means feeding your LLM precisely the information it needs, without overwhelming it or missing important details. This balance is essential for optimizing both performance and cost in LLM applications.

Understanding Chunking in RAG

Chunking involves breaking down extensive text or data sources into smaller segments, or "chunks." These chunks are then typically embedded and stored in a vector database for efficient similarity search. When a user poses a query, the RAG system retrieves the most relevant chunks to provide context for the LLM's generation process.

Without proper chunking, RAG systems face several challenges. LLMs have finite context windows; feeding overly large chunks can exceed these limits or introduce noise, diluting the important information. Conversely, chunks that are too small might lack sufficient context, leading to fragmented or incomplete answers. The goal is to create chunks that are semantically complete yet concise.

Several strategies exist for chunking documents, each with its own set of advantages and disadvantages. The choice of strategy often depends on the document structure, content type, and the specific requirements of the RAG application.

Fixed-Size Chunking

Fixed-size chunking is the most straightforward method. It involves splitting the text into segments of a predetermined length, typically measured in characters or tokens. An overlap between consecutive chunks is often introduced to maintain some contextual continuity.

  • Pros: Simple to implement and computationally inexpensive.
  • Cons: Can arbitrarily cut sentences or ideas, potentially breaking semantic meaning. This can result in chunks that are difficult for the LLM to interpret correctly.

Example Paragraph Illustration:

Artificial intelligence is rapidly changing our daily routines. Machine learning, a subset of AI, involves algorithms that learn from data. Deep learning, a further subset, uses neural networks with many layers. These technologies are applied in various fields, from healthcare to finance. Ethical considerations are also very important.

If chunked with a chunk_size = 70 characters and overlap_size = 10:

Chunk Content
1 "Artificial intelligence is rapidly changing our daily routines. Machine "
2 "routines. Machine learning, a subset of AI, involves algorithms that l" (Overlapping "routines. Machine ")
3 "orithms that learn from data. Deep learning, a further subset, uses n" (Overlapping "orithms that l")
4 "r subset, uses neural networks with many layers. These technologies ar" (Overlapping "r subset, uses n")
5 "echnologies are applied in various fields, from healthcare to finance. Ethical considerations are also very important." (Overlapping "echnologies ar")

This shows how sentences and ideas can be cut mid-way.

# Example using character count
def fixed_size_char_chunker(text, chunk_size, overlap_size):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap_size
        if end >= len(text): # Ensure last part is captured
            break
    return chunks

# text = "Your very long document text goes here..."
# chunks = fixed_size_char_chunker(text, 200, 20)

For token-based fixed-size chunking, libraries like tiktoken (from OpenAI) are commonly used to count tokens accurately for specific LLMs.

Sentence Splitting

This strategy divides the text based on sentence boundaries. Each chunk consists of one or more complete sentences. This approach generally preserves semantic integrity better than fixed-size chunking.

  • Pros: Maintains sentence structure, leading to more coherent chunks.
  • Cons: Sentence lengths can vary significantly, resulting in inconsistent chunk sizes. Very long sentences might still exceed desired chunk lengths if only one sentence is used per chunk.

Example Paragraph Illustration:

Using the same text: "Artificial intelligence is rapidly changing our daily routines. Machine learning, a subset of AI, involves algorithms that learn from data. Deep learning, a further subset, uses neural networks with many layers. These technologies are applied in various fields, from healthcare to finance. Ethical considerations are also very important."

The sentences are:

  1. "Artificial intelligence is rapidly changing our daily routines."
  2. "Machine learning, a subset of AI, involves algorithms that learn from data."
  3. "Deep learning, a further subset, uses neural networks with many layers."
  4. "These technologies are applied in various fields, from healthcare to finance."
  5. "Ethical considerations are also very important."

If chunked with sentences_per_chunk = 2, the resulting chunks are:

Chunk Content
1 "Artificial intelligence is rapidly changing our daily routines. Machine learning, a subset of AI, involves algorithms that learn from data."
2 "Deep learning, a further subset, uses neural networks with many layers. These technologies are applied in various fields, from healthcare to finance."
3 "Ethical considerations are also very important."

Each chunk contains complete sentences, maintaining readability and some semantic context.

import nltk
# Ensure 'punkt' is downloaded: nltk.download('punkt', quiet=True)
from nltk.tokenize import sent_tokenize

def sentence_chunker(text, sentences_per_chunk=3):
    sentences = sent_tokenize(text)
    chunks = []
    for i in range(0, len(sentences), sentences_per_chunk):
        chunk = " ".join(sentences[i:i + sentences_per_chunk])
        chunks.append(chunk)
    return chunks

# text = "This is the first sentence. Here is another one. And a third."
# chunks = sentence_chunker(text, sentences_per_chunk=2)

Libraries like spaCy also offer robust sentence segmentation capabilities.

Recursive Chunking

Recursive chunking attempts to divide text using a hierarchical list of separators. It starts with larger separators (e.g., double newlines for paragraphs) and recursively splits the text using smaller separators (e.g., single newlines for line breaks, then spaces) if the resulting chunks are still too large. The aim is to keep semantically related text segments together as much as possible.

  • Pros: Often provides a good balance between maintaining semantic coherence and controlling chunk size. Adapts well to various document structures.
  • Cons: Can be slightly more complex to configure the hierarchy of separators effectively.

Example Paragraph Illustration:

Artificial intelligence is rapidly changing our daily routines. Machine learning, a subset of AI, involves algorithms that learn from data. Deep learning, a further subset, uses neural networks with many layers. These technologies are applied in various fields, from healthcare to finance. Ethical considerations are also very important.

Assume chunk_size = 150 characters, separators = ["\n\n", "\n", ". ", " ", ""]. The text has no \n\n or \n. It would first split by ". ". Sentences (and approx. lengths):

  1. "Artificial intelligence is rapidly changing our daily routines" (60)
  2. "Machine learning, a subset of AI, involves algorithms that learn from data" (75)
  3. "Deep learning, a further subset, uses neural networks with many layers" (72)
  4. "These technologies are applied in various fields, from healthcare to finance" (76)
  5. "Ethical considerations are also very important" (46)

The splitter attempts to create chunks near 150 chars, respecting separators:

Chunk Content
1 "Artificial intelligence is rapidly changing our daily routines. Machine learning, a subset of AI, involves algorithms that learn from data."
2 "Deep learning, a further subset, uses neural networks with many layers. These technologies are applied in various fields, from healthcare to finance."
3 "Ethical considerations are also very important."

If a segment (e.g., a paragraph from \n\n) were still too large after the initial split, it would be recursively split by ". ", then by " ", etc.

Many modern NLP libraries, such as LangChain, provide implementations of recursive chunkers.

# Using LangChain's RecursiveCharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,        # Target size in characters
    chunk_overlap=50,      # Overlap in characters
    separators=["\n\n", "\n", ". ", " ", ""] # Order matters
)

# long_document_text = "..."
# chunks = text_splitter.split_text(long_document_text)

This method is generally preferred for unstructured text due to its flexibility.

Content-Aware Chunking (Semantic Chunking)

Content-aware, or semantic, chunking uses NLP techniques to identify natural breaks in meaning or topic within the text. Instead of relying on fixed sizes or syntactic delimiters, it analyzes the semantic content to form chunks. This can involve using embedding models to measure the similarity between adjacent sentences or paragraphs; a significant drop in similarity might indicate a good splitting point.

  • Pros: Produces highly coherent chunks that align with the topical structure of the document. This often leads to better retrieval relevance.
  • Cons: Computationally more intensive than simpler methods. Requires access to embedding models and more complex logic.

Example Paragraph Illustration:

The Apollo program achieved its goal of landing humans on the Moon. Key figures included Neil Armstrong and Buzz Aldrin. The Saturn V rocket was essential for these missions. Separately, developments in microbiology during the same era led to new antibiotics. Research into penicillin was particularly impactful.

Semantic chunking would analyze sentence similarity:

  1. "The Apollo program achieved its goal of landing humans on the Moon." (Topic: Apollo missions)
  2. "Key figures included Neil Armstrong and Buzz Aldrin." (Topic: Apollo missions - high similarity to 1)
  3. "The Saturn V rocket was essential for these missions." (Topic: Apollo missions - high similarity to 1 & 2)
  4. "Separately, developments in microbiology during the same era led to new antibiotics." (Topic: Microbiology - low similarity to 1-3, indicating a topic shift)
  5. "Research into penicillin was particularly impactful." (Topic: Microbiology - high similarity to 4)

Resulting chunks based on semantic grouping:

Chunk Topic Content
1 Space Exploration "The Apollo program achieved its goal of landing humans on the Moon. Key figures included Neil Armstrong and Buzz Aldrin. The Saturn V rocket was essential for these missions."
2 Microbiology "Separately, developments in microbiology during the same era led to new antibiotics. Research into penicillin was particularly impactful."

One approach involves calculating cosine similarity between embeddings of consecutive sentences. When the similarity score falls below a certain threshold, a new chunk begins.

Illustration of how semantic chunking might segment a document based on shifts in topic or meaning, identified by similarity scores between text segments.

Agentic Chunking

Agentic chunking represents an advanced, emerging strategy where an LLM agent actively participates in the chunking process. The agent analyzes the document content, potentially considering the types of queries it might need to answer, to determine optimal chunk boundaries. This could involve the LLM generating summaries for sections and then chunking based on these summaries or identifying logical divisions based on its understanding of the text.

  • Pros: Potentially the most adaptive and contextually relevant chunking method, as it leverages the LLM's reasoning capabilities.
  • Cons: Computationally very expensive due to multiple LLM calls. Still an area of active research and development, with practical implementations being less common for large-scale deployments.

Example Paragraph Illustration:

User Manual: ACME Widget Model X. Section 1: Setup. To set up your ACME Widget X, first unbox all components. Then, connect the primary module to a stable supply of electricity. Refer to Figure 1.1 for component identification. Section 2: Operation. Press the main button to turn on the device. The indicator light should turn green. If it flashes red, consult Section 3: Troubleshooting. Section 3: Troubleshooting. Common issues include electricity supply problems or connectivity failures. For red flashing light, ensure electricity supply is stable. For connectivity, check cable C. Expected queries: How to set up Widget X? What does a red flashing light mean? My widget won't turn on.

An LLM agent analyzing this might create chunks aligned with user intent and document structure:

Chunk Agent's Reasoning Content
1 This block directly addresses setup queries. "User Manual: ACME Widget Model X. Section 1: Setup. To set up your ACME Widget X, first unbox all components. Then, connect the primary module to a stable supply of electricity. Refer to Figure 1.1 for component identification."
2 Explains basic operation and points to solutions for a common issue. "Section 2: Operation. Press the main button to turn on the device. The indicator light should turn green. If it flashes red, consult Section 3: Troubleshooting."
3 Provides direct answers to anticipated troubleshooting queries. "Section 3: Troubleshooting. Common issues include electricity supply problems or connectivity failures. For red flashing light, ensure electricity supply is stable. For connectivity, check cable C."

This approach is at the forefront of RAG optimization, aiming for a deeper understanding of content before segmentation.

Choosing the Right Chunking Strategy

Selecting the most suitable chunking strategy is not a one-size-fits-all decision. It depends on several factors related to your data, application, and available resources.

Factors to Consider

Factor Details
Nature of the Data Unstructured Text (prose, articles): Recursive or sentence-based chunking works well; semantic chunking is better if cost is acceptable.
Semi-Structured Text (Markdown, HTML): Use structural tags like headers and list items. E.g., MarkdownTextSplitter.
Structured Data (tables, code): Use specialized chunkers. For code: by function/class/logical blocks (e.g., PythonCodeTextSplitter). For tables: chunk by rows or meaningful row/column collections.
Task Requirements Specific Q&A: Use smaller, focused chunks.
Summarization: Use larger chunks for thematic coverage.
Conversational AI: Ensure chunks support context and continuity.
LLM Context Window Total chunk size + prompt must fit within the model's context window, limiting chunk size and quantity.
Computational Resources & Latency - Fixed-size or sentence chunking is fast.
- Semantic/agentic chunking is resource-intensive and increases latency if dynamic.
Desired Granularity - Use smaller chunks for specific questions.
- Use larger chunks for broader inquiries.

Strategy Comparison

Strategy              Complexity Semantic Coherence Processing Speed Context Control Typical Use Case                       
Fixed-Size            Low        Low                Fast            Low              Simple texts, initial experimentation 
Sentence Splitting    Low-Medium Medium              Fast            Medium          General prose, when sentences are key 
Recursive            Medium      Medium-High        Moderate        Medium-High      Most unstructured text, good default   
Content-Aware        High        High                Slow            High            Complex documents, high relevance needed
Agentic              Very High  Very High          Very Slow        Very High        Research, highly specialized tasks     

Advanced Considerations & Best Practices

Beyond choosing a basic strategy, several other considerations can enhance the effectiveness of your chunking process.

Chunk Overlap

Chunk overlap involves including a small portion of text from the end of the preceding chunk at outflows beginning of the current chunk, and vice-versa. This helps maintain context across chunk boundaries, ensuring that information isn't lost if a relevant piece of text falls near a split point.

  • Benefit: Reduces the chance of missing context that spans across two chunks.
  • Recommendation: A common overlap size is 10-20% of the chunk size. The ideal overlap can depend on the text density and type.

Chunk Metadata

Associating metadata with each chunk is highly valuable. This metadata can include information like the source document ID, page number, section titles, original filename, creation date, or even URLs.

  • Benefit: Improves retrieval by allowing filtering based on metadata (e.g., retrieve only chunks from a specific document or recent date). Enables citation of sources in the LLM's response. Helps in debugging and understanding the origin of retrieved context.
{
  "chunk_id": "doc1_chunk_003",
  "text": "The RAG system then retrieves these chunks...",
  "metadata": {
    "source_document": "rag_overview.pdf",
    "page_number": 4,
    "section": "2.1 Retrieval Process",
    "timestamp": "2023-10-26T10:30:00Z"
  }
}

Evaluating Chunking Strategies

Measuring the effectiveness of your chunking strategy is important. This often involves an iterative process:

Step Description
Define Metrics Retrieval Metrics: Use (query, relevant_document_ids) pairs. Measure Hit Rate, Mean Reciprocal Rank (MRR), and NDCG for retrieved chunks.
End-to-End RAG Performance: Assess final LLM-generated answers based on retrieved chunks. Tools like RAGAs can help.
Experiment Try different chunking strategies, sizes, and overlaps.
Analyze Inspect retrieved chunks for a sample of queries:
- Are they relevant?
- Do they contain complete information?
- Are they too noisy?
Refine Adjust chunking parameters based on evaluation results.

Preprocessing and Normalization

Before chunking, preprocess your text to improve quality:

  • Cleaning: Remove irrelevant characters, excessive whitespace, or artifacts (e.g., HTML tags if not using a structure-aware parser).
  • Normalization: Convert text to a consistent format (e.g., lowercase). While modern embedding models are robust to case, some normalization can still be beneficial. Be cautious with stemming or lemmatization, as they can sometimes alter meaning, though they might reduce vocabulary size for certain retrieval models.

Handling Different Document Structures

  • PDFs: These can be challenging due to complex layouts, headers/footers, and multi-column text. Tools that perform layout-aware PDF parsing (e.g., those that identify text blocks rather than just raw text streams) are preferable before chunking.
  • Markdown/HTML: Leverage the inherent structure. Split by headers, list items, or other semantic tags to create meaningful chunks. Libraries often have specific splitters for these formats.
  • Code: Chunking by function, class, or even smaller logical blocks (e.g., important loops or conditional blocks with comments) can be effective. Specialized code splitters are available in libraries like LangChain.

Tools and Libraries for Chunking

Several open-source libraries provide robust implementations of various chunking strategies, significantly simplifying development:

LangChain: Offers a comprehensive suite of TextSplitter classes, including CharacterTextSplitter, RecursiveCharacterTextSplitter, MarkdownTextSplitter, PythonCodeTextSplitter, and more. These are highly configurable.     ```python     from langchain_text_splitters import RecursiveCharacterTextSplitter

    python_splitter = RecursiveCharacterTextSplitter.from_language(         language="python", chunk_size=1000, chunk_overlap=100     )     # python_code = "... your Python source code ..."     # python_chunks = python_splitter.split_text(python_code)     ```

LlamaIndex (formerly GPT Index): Provides extensive node parsing and text splitting capabilities, deeply integrated into its indexing and retrieval pipelines. It supports various strategies similar to LangChain and also offers semantic splitting options.

NLTK (Natural Language Toolkit) & spaCy: Excellent for fundamental NLP tasks like sentence tokenization, which forms the basis for sentence-based chunking. SpaCy's sentence segmentation is generally very robust.

Haystack (by deepset): An end-to-end NLP framework that includes document preprocessing and chunking components as part of its indexing pipeline for question answering and semantic search.

Experimenting with these tools and their different configurations is often the best way to find an optimal setup for your specific RAG application.

Conclusion

Effective document chunking is a foundational element of high-performing Retrieval Augmented Generation systems. The choice of chunking strategy, ranging from simple fixed-size splits to sophisticated content-aware methods, significantly impacts the relevance of retrieved context and the quality of LLM-generated responses.

There is no single "best" chunking strategy; the optimal approach depends on the data's nature, the specific task, computational constraints, and the LLM being used. By understanding the available techniques, their trade-offs, and by employing thoughtful evaluation, you can fine-tune their chunking process to build more accurate, reliable, and efficient RAG applications.

© 2025 ApX Machine Learning. All rights reserved.