When building Retrieval-Augmented Generation (RAG) systems, preparing retrieved information for the LLM is a critical process. This involves more than just concatenating text; it encompasses managing token limits, formatting the content clearly, and structuring the information to guide the model toward the best possible response. The effective preparation of this information is a significant factor in the quality of a RAG system's output.
Large language models operate with a fixed context window, a limit on the number of tokens they can process at once. The combined size of your system prompt, user query, and retrieved documents must not exceed this limit. Even after splitting documents into smaller chunks, the retrieved context can easily add up to thousands of tokens, making it necessary to reduce its size before sending it to the model.
The retrieval module provides the compress_context function to help you shrink the retrieved results to fit within a specified token budget.
from kerb.retrieval import compress_context
# Assume 'results' is a list of SearchResult objects from a search function
# and 'query' is the user's search query.
# Compress the results to fit within approximately 2000 tokens
compressed_results = compress_context(
query=query,
results=results,
max_tokens=2000
)
The compress_context function offers several compression methods via the strategy parameter, giving you control over the trade-off between speed and information retention:
"top_k" (default): This is the simplest strategy. It keeps the highest-ranked search results that fit within the max_tokens limit and discards the rest. It is fast and effective when your initial ranking is reliable."summarize": This strategy uses an LLM to generate a concise summary of each document chunk that is too long. It is more computationally intensive but can retain important information from longer documents."filter": This method removes sentences or parts of documents that are less relevant to the original query, effectively filtering out noise while preserving the most pertinent information."truncate": A direct approach that simply cuts off each document chunk after a certain number of characters or tokens.For most applications, starting with the "top_k" strategy is a good balance of performance and simplicity.
Once you have a list of compressed or filtered SearchResult objects, you need to convert them into a single string that the LLM can process. This string serves as the "context" portion of your final prompt and its structure can heavily influence the quality of the generated answer.
The results_to_context function is designed for this task. It takes a list of search results and formats them into a clean, readable string suitable for an LLM prompt.
from kerb.retrieval import results_to_context
# 'compressed_results' is the output from the previous step
context_string = results_to_context(
results=compressed_results,
separator='\n\n---\n\n',
include_source=True
)
# Now, build the final prompt for the LLM
query = "How does async programming work in Python?"
prompt = f"""Answer the question based on the provided context.
Context:
{context_string}
Question: {query}
Answer:"""
print(prompt[:500] + "...")
The function provides two parameters for customization:
separator: A string used to separate the content of each document. A clear separator helps the model distinguish between different sources.include_source: When True, it includes the document's ID (from doc.id) with its content. This is valuable for source attribution, as it allows the LLM to cite which document provided specific information in its response.In a production RAG system, context management is a multi-step process. You typically filter and re-order documents before compressing and formatting them to ensure the highest-quality information is sent to the LLM.
The RAGPipeline class in the reference examples demonstrates an effective workflow within its get_context method. The sequence of operations is as follows:
top_k=12).filter_results. This could involve setting a minimum relevance score or deduplicating similar chunks.diversify_results to ensure the context covers different aspects of the query and reduce redundancy.compress_context to shrink the filtered results to fit the final token budget.results_to_context to create the final string for the LLM prompt.Here is how that sequence looks in practice, adapted from the 05_rag_pipeline.py example:
from kerb.retrieval import (
filter_results,
diversify_results,
compress_context,
results_to_context
)
# Assume 'results' is a list of 12 SearchResult objects
query = "What are the differences between REST and GraphQL APIs?"
# 1. Filter by quality
filtered = filter_results(
results,
min_score=0.2,
dedup_threshold=0.9
)
# 2. Apply diversity to reduce redundancy
diversified = diversify_results(filtered, max_results=8, diversity_factor=0.4)
# 3. Compress to fit the context window
compressed = compress_context(query, diversified, max_tokens=2000)
# 4. Format for the LLM prompt
final_context = results_to_context(compressed)
print(final_context[:500] + "...")
This structured approach ensures that the context provided to the LLM is not just a random collection of retrieved texts, but a carefully curated set of high-quality, relevant, and appropriately sized information. Proper management of the retrieved context is just as important as the retrieval itself for achieving accurate and helpful RAG responses.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with