By Lea M. on May 20, 2025
Retrieval Augmented Generation (RAG) systems enhance Large Language Model (LLM) responses by providing relevant external knowledge. A fundamental step in building effective RAG systems is chunking, the process of dividing large documents into smaller, digestible pieces. The quality of these chunks directly influences the relevance of the retrieved context and, consequently, the accuracy and usefulness of the LLM's output.
Getting chunking right means feeding your LLM precisely the information it needs, without overwhelming it or missing important details. This balance is essential for optimizing both performance and cost in LLM applications.
Chunking involves breaking down extensive text or data sources into smaller segments, or "chunks." These chunks are then typically embedded and stored in a vector database for efficient similarity search. When a user poses a query, the RAG system retrieves the most relevant chunks to provide context for the LLM's generation process.
Without proper chunking, RAG systems face several challenges. LLMs have finite context windows; feeding overly large chunks can exceed these limits or introduce noise, diluting the important information. Conversely, chunks that are too small might lack sufficient context, leading to fragmented or incomplete answers. The goal is to create chunks that are semantically complete yet concise.
Several strategies exist for chunking documents, each with its own set of advantages and disadvantages. The choice of strategy often depends on the document structure, content type, and the specific requirements of the RAG application.
Fixed-size chunking is the most straightforward method. It involves splitting the text into segments of a predetermined length, typically measured in characters or tokens. An overlap between consecutive chunks is often introduced to maintain some contextual continuity.
Example Paragraph Illustration:
Artificial intelligence is rapidly changing our daily routines. Machine learning, a subset of AI, involves algorithms that learn from data. Deep learning, a further subset, uses neural networks with many layers. These technologies are applied in various fields, from healthcare to finance. Ethical considerations are also very important.
If chunked with a chunk_size = 70
characters and overlap_size = 10
:
Chunk | Content |
---|---|
1 | "Artificial intelligence is rapidly changing our daily routines. Machine " |
2 | "routines. Machine learning, a subset of AI, involves algorithms that l" (Overlapping "routines. Machine ") |
3 | "orithms that learn from data. Deep learning, a further subset, uses n" (Overlapping "orithms that l") |
4 | "r subset, uses neural networks with many layers. These technologies ar" (Overlapping "r subset, uses n") |
5 | "echnologies are applied in various fields, from healthcare to finance. Ethical considerations are also very important." (Overlapping "echnologies ar") |
This shows how sentences and ideas can be cut mid-way.
# Example using character count
def fixed_size_char_chunker(text, chunk_size, overlap_size):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start += chunk_size - overlap_size
if end >= len(text): # Ensure last part is captured
break
return chunks
# text = "Your very long document text goes here..."
# chunks = fixed_size_char_chunker(text, 200, 20)
For token-based fixed-size chunking, libraries like tiktoken
(from OpenAI) are commonly used to count tokens accurately for specific LLMs.
This strategy divides the text based on sentence boundaries. Each chunk consists of one or more complete sentences. This approach generally preserves semantic integrity better than fixed-size chunking.
Example Paragraph Illustration:
Using the same text: "Artificial intelligence is rapidly changing our daily routines. Machine learning, a subset of AI, involves algorithms that learn from data. Deep learning, a further subset, uses neural networks with many layers. These technologies are applied in various fields, from healthcare to finance. Ethical considerations are also very important."
The sentences are:
If chunked with sentences_per_chunk = 2
, the resulting chunks are:
Chunk | Content |
---|---|
1 | "Artificial intelligence is rapidly changing our daily routines. Machine learning, a subset of AI, involves algorithms that learn from data." |
2 | "Deep learning, a further subset, uses neural networks with many layers. These technologies are applied in various fields, from healthcare to finance." |
3 | "Ethical considerations are also very important." |
Each chunk contains complete sentences, maintaining readability and some semantic context.
import nltk
# Ensure 'punkt' is downloaded: nltk.download('punkt', quiet=True)
from nltk.tokenize import sent_tokenize
def sentence_chunker(text, sentences_per_chunk=3):
sentences = sent_tokenize(text)
chunks = []
for i in range(0, len(sentences), sentences_per_chunk):
chunk = " ".join(sentences[i:i + sentences_per_chunk])
chunks.append(chunk)
return chunks
# text = "This is the first sentence. Here is another one. And a third."
# chunks = sentence_chunker(text, sentences_per_chunk=2)
Libraries like spaCy also offer robust sentence segmentation capabilities.
Recursive chunking attempts to divide text using a hierarchical list of separators. It starts with larger separators (e.g., double newlines for paragraphs) and recursively splits the text using smaller separators (e.g., single newlines for line breaks, then spaces) if the resulting chunks are still too large. The aim is to keep semantically related text segments together as much as possible.
Example Paragraph Illustration:
Artificial intelligence is rapidly changing our daily routines. Machine learning, a subset of AI, involves algorithms that learn from data. Deep learning, a further subset, uses neural networks with many layers. These technologies are applied in various fields, from healthcare to finance. Ethical considerations are also very important.
Assume chunk_size = 150
characters, separators = ["\n\n", "\n", ". ", " ", ""]
. The text has no \n\n
or \n
. It would first split by ". ".
Sentences (and approx. lengths):
The splitter attempts to create chunks near 150 chars, respecting separators:
Chunk | Content |
---|---|
1 | "Artificial intelligence is rapidly changing our daily routines. Machine learning, a subset of AI, involves algorithms that learn from data." |
2 | "Deep learning, a further subset, uses neural networks with many layers. These technologies are applied in various fields, from healthcare to finance." |
3 | "Ethical considerations are also very important." |
If a segment (e.g., a paragraph from \n\n
) were still too large after the initial split, it would be recursively split by ". ", then by " ", etc.
Many modern NLP libraries, such as LangChain, provide implementations of recursive chunkers.
# Using LangChain's RecursiveCharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # Target size in characters
chunk_overlap=50, # Overlap in characters
separators=["\n\n", "\n", ". ", " ", ""] # Order matters
)
# long_document_text = "..."
# chunks = text_splitter.split_text(long_document_text)
This method is generally preferred for unstructured text due to its flexibility.
Content-aware, or semantic, chunking uses NLP techniques to identify natural breaks in meaning or topic within the text. Instead of relying on fixed sizes or syntactic delimiters, it analyzes the semantic content to form chunks. This can involve using embedding models to measure the similarity between adjacent sentences or paragraphs; a significant drop in similarity might indicate a good splitting point.
Example Paragraph Illustration:
The Apollo program achieved its goal of landing humans on the Moon. Key figures included Neil Armstrong and Buzz Aldrin. The Saturn V rocket was essential for these missions. Separately, developments in microbiology during the same era led to new antibiotics. Research into penicillin was particularly impactful.
Semantic chunking would analyze sentence similarity:
Resulting chunks based on semantic grouping:
Chunk | Topic | Content |
---|---|---|
1 | Space Exploration | "The Apollo program achieved its goal of landing humans on the Moon. Key figures included Neil Armstrong and Buzz Aldrin. The Saturn V rocket was essential for these missions." |
2 | Microbiology | "Separately, developments in microbiology during the same era led to new antibiotics. Research into penicillin was particularly impactful." |
One approach involves calculating cosine similarity between embeddings of consecutive sentences. When the similarity score falls below a certain threshold, a new chunk begins.
Illustration of how semantic chunking might segment a document based on shifts in topic or meaning, identified by similarity scores between text segments.
Agentic chunking represents an advanced, emerging strategy where an LLM agent actively participates in the chunking process. The agent analyzes the document content, potentially considering the types of queries it might need to answer, to determine optimal chunk boundaries. This could involve the LLM generating summaries for sections and then chunking based on these summaries or identifying logical divisions based on its understanding of the text.
Example Paragraph Illustration:
User Manual: ACME Widget Model X. Section 1: Setup. To set up your ACME Widget X, first unbox all components. Then, connect the primary module to a stable supply of electricity. Refer to Figure 1.1 for component identification. Section 2: Operation. Press the main button to turn on the device. The indicator light should turn green. If it flashes red, consult Section 3: Troubleshooting. Section 3: Troubleshooting. Common issues include electricity supply problems or connectivity failures. For red flashing light, ensure electricity supply is stable. For connectivity, check cable C. Expected queries: How to set up Widget X? What does a red flashing light mean? My widget won't turn on.
An LLM agent analyzing this might create chunks aligned with user intent and document structure:
Chunk | Agent's Reasoning | Content |
---|---|---|
1 | This block directly addresses setup queries. | "User Manual: ACME Widget Model X. Section 1: Setup. To set up your ACME Widget X, first unbox all components. Then, connect the primary module to a stable supply of electricity. Refer to Figure 1.1 for component identification." |
2 | Explains basic operation and points to solutions for a common issue. | "Section 2: Operation. Press the main button to turn on the device. The indicator light should turn green. If it flashes red, consult Section 3: Troubleshooting." |
3 | Provides direct answers to anticipated troubleshooting queries. | "Section 3: Troubleshooting. Common issues include electricity supply problems or connectivity failures. For red flashing light, ensure electricity supply is stable. For connectivity, check cable C." |
This approach is at the forefront of RAG optimization, aiming for a deeper understanding of content before segmentation.
Selecting the most suitable chunking strategy is not a one-size-fits-all decision. It depends on several factors related to your data, application, and available resources.
Factor | Details |
---|---|
Nature of the Data | Unstructured Text (prose, articles): Recursive or sentence-based chunking works well; semantic chunking is better if cost is acceptable. Semi-Structured Text (Markdown, HTML): Use structural tags like headers and list items. E.g., MarkdownTextSplitter .Structured Data (tables, code): Use specialized chunkers. For code: by function/class/logical blocks (e.g., PythonCodeTextSplitter ). For tables: chunk by rows or meaningful row/column collections. |
Task Requirements | Specific Q&A: Use smaller, focused chunks. Summarization: Use larger chunks for thematic coverage. Conversational AI: Ensure chunks support context and continuity. |
LLM Context Window | Total chunk size + prompt must fit within the model's context window, limiting chunk size and quantity. |
Computational Resources & Latency | - Fixed-size or sentence chunking is fast. - Semantic/agentic chunking is resource-intensive and increases latency if dynamic. |
Desired Granularity | - Use smaller chunks for specific questions. - Use larger chunks for broader inquiries. |
Strategy | Complexity | Semantic Coherence | Processing Speed | Context Control | Typical Use Case |
---|---|---|---|---|---|
Fixed-Size | Low | Low | Fast | Low | Simple texts, initial experimentation |
Sentence Splitting | Low-Medium | Medium | Fast | Medium | General prose, when sentences are key |
Recursive | Medium | Medium-High | Moderate | Medium-High | Most unstructured text, good default |
Content-Aware | High | High | Slow | High | Complex documents, high relevance needed |
Agentic | Very High | Very High | Very Slow | Very High | Research, highly specialized tasks |
Beyond choosing a basic strategy, several other considerations can enhance the effectiveness of your chunking process.
Chunk overlap involves including a small portion of text from the end of the preceding chunk at outflows beginning of the current chunk, and vice-versa. This helps maintain context across chunk boundaries, ensuring that information isn't lost if a relevant piece of text falls near a split point.
Associating metadata with each chunk is highly valuable. This metadata can include information like the source document ID, page number, section titles, original filename, creation date, or even URLs.
{
"chunk_id": "doc1_chunk_003",
"text": "The RAG system then retrieves these chunks...",
"metadata": {
"source_document": "rag_overview.pdf",
"page_number": 4,
"section": "2.1 Retrieval Process",
"timestamp": "2023-10-26T10:30:00Z"
}
}
Measuring the effectiveness of your chunking strategy is important. This often involves an iterative process:
Step | Description |
---|---|
Define Metrics | Retrieval Metrics: Use (query, relevant_document_ids) pairs. Measure Hit Rate, Mean Reciprocal Rank (MRR), and NDCG for retrieved chunks. End-to-End RAG Performance: Assess final LLM-generated answers based on retrieved chunks. Tools like RAGAs can help. |
Experiment | Try different chunking strategies, sizes, and overlaps. |
Analyze | Inspect retrieved chunks for a sample of queries: - Are they relevant? - Do they contain complete information? - Are they too noisy? |
Refine | Adjust chunking parameters based on evaluation results. |
Before chunking, preprocess your text to improve quality:
Several open-source libraries provide robust implementations of various chunking strategies, significantly simplifying development:
* LangChain: Offers a comprehensive suite of TextSplitter
classes, including CharacterTextSplitter
, RecursiveCharacterTextSplitter
, MarkdownTextSplitter
, PythonCodeTextSplitter
, and more. These are highly configurable.
```python
from langchain_text_splitters import RecursiveCharacterTextSplitter
python_splitter = RecursiveCharacterTextSplitter.from_language( language="python", chunk_size=1000, chunk_overlap=100 ) # python_code = "... your Python source code ..." # python_chunks = python_splitter.split_text(python_code) ```
* LlamaIndex (formerly GPT Index): Provides extensive node parsing and text splitting capabilities, deeply integrated into its indexing and retrieval pipelines. It supports various strategies similar to LangChain and also offers semantic splitting options.
* NLTK (Natural Language Toolkit) & spaCy: Excellent for fundamental NLP tasks like sentence tokenization, which forms the basis for sentence-based chunking. SpaCy's sentence segmentation is generally very robust.
* Haystack (by deepset): An end-to-end NLP framework that includes document preprocessing and chunking components as part of its indexing pipeline for question answering and semantic search.
Experimenting with these tools and their different configurations is often the best way to find an optimal setup for your specific RAG application.
Effective document chunking is a foundational element of high-performing Retrieval Augmented Generation systems. The choice of chunking strategy, ranging from simple fixed-size splits to sophisticated content-aware methods, significantly impacts the relevance of retrieved context and the quality of LLM-generated responses.
There is no single "best" chunking strategy; the optimal approach depends on the data's nature, the specific task, computational constraints, and the LLM being used. By understanding the available techniques, their trade-offs, and by employing thoughtful evaluation, you can fine-tune their chunking process to build more accurate, reliable, and efficient RAG applications.
Recommended Posts
© 2025 ApX Machine Learning. All rights reserved.
Recommended Courses
Related to this post