Building a functional retrieval pipeline involves assembling data processing and embeddings. This process forms the "Retrieval" core of Retrieval-Augmented Generation. The goal is to take a user's query, search your prepared documents, and return the most relevant information to serve as context for a Large Language Model.
The retrieval pipeline consists of a few distinct stages that work in sequence: searching, ranking, and formatting. For our first implementation, we will build a straightforward pipeline that performs a search and formats the results for an LLM.
The retrieval pipeline takes a user query, finds relevant documents using a hybrid search mechanism, and formats the results into a context string ready to be injected into an LLM prompt.
Building a retrieval pipeline involves combining the modules for document representation, embedding, and searching. For this example, we will work with an in-memory collection of documents and their pre-computed embeddings. In a production system, these would typically be stored and managed in a specialized vector database.
Let's start by defining our knowledge base and preparing it for search. This involves creating Document objects and generating their corresponding embeddings in a batch.
from kerb.retrieval import Document
from kerb.embedding import embed, embed_batch
# 1. Define the knowledge base
documents = [
Document(
id="doc1",
content="Python is a high-level programming language known for its simplicity and readability."
),
Document(
id="doc2",
content="Asynchronous programming in Python allows concurrent execution using async/await syntax."
),
Document(
id="doc3",
content="Machine learning models learn patterns from data to make predictions."
),
Document(
id="doc4",
content="REST APIs provide a standardized way for applications to communicate over HTTP."
),
]
# 2. Generate embeddings for the documents
print("Generating document embeddings...")
doc_texts = [doc.content for doc in documents]
doc_embeddings = embed_batch(doc_texts)
print(f"Generated {len(doc_embeddings)} embeddings.")
With our documents and embeddings ready, the next step is to perform a search. While you can use simple keyword_search or semantic_search, a more effective approach is hybrid_search. This function combines the strengths of both keyword matching (for specific terms) and semantic similarity (for meaning), providing more reliable results across different types of queries.
To use hybrid_search, you need the user's query text, its embedding, the list of documents, and their corresponding embeddings.
from kerb.retrieval import hybrid_search
# 3. Define a user query and generate its embedding
query = "building scalable python applications"
query_embedding = embed(query)
# 4. Perform the search
search_results = hybrid_search(
query=query,
query_embedding=query_embedding,
documents=documents,
document_embeddings=doc_embeddings,
top_k=2,
keyword_weight=0.4,
semantic_weight=0.6
)
print(f"\nFound {len(search_results)} relevant documents for the query: '{query}'")
for result in search_results:
print(f" - ID: {result.document.id}, Score: {result.score:.3f}")
print(f" Content: {result.document.content[:80]}...")
The output of hybrid_search is a list of SearchResult objects, each containing a Document and a relevance score. While this is useful for inspection, it is not in the right format to be passed to an LLM. The model expects a single string of text as its context.
The results_to_context function handles this conversion for you. It takes the list of SearchResult objects and formats their content into a clean, well-structured string, with separators between each document.
from kerb.retrieval import results_to_context
# 5. Format the search results into a context string
context_string = results_to_context(search_results)
print("\nFormatted context for LLM:")
print("---------------------------")
print(context_string)
print("---------------------------")
This formatted context_string is the final output of our simple retrieval pipeline. It contains the most relevant information from your knowledge base, ready to be combined with the user's query to form a complete prompt. The LLM will use this context to generate a grounded, accurate, and informative response.
Let's consolidate these steps into a single, cohesive example that simulates a complete RAG flow from query to final prompt construction. This demonstrates how the different modules work together to retrieve and prepare information for generation.
from kerb.retrieval import Document, hybrid_search, results_to_context
from kerb.embedding import embed, embed_batch
# --- 1. Setup Phase ---
# In a real application, this is done once to build your index.
documents = [
Document(id="py-async", content="Asynchronous programming in Python uses async/await syntax for I/O-bound tasks."),
Document(id="py-web", content="FastAPI and Flask are popular Python frameworks for building web APIs."),
Document(id="ml-intro", content="Machine learning enables systems to learn from data without explicit programming."),
]
# Generate and store embeddings for the knowledge base
doc_embeddings = embed_batch([doc.content for doc in documents])
# --- 2. Retrieval Phase ---
# This is performed for each user query.
user_query = "How do I build fast web services in Python?"
# Embed the user query
query_embedding = embed(user_query)
# Perform hybrid search to find relevant documents
retrieved_results = hybrid_search(
query=user_query,
query_embedding=query_embedding,
documents=documents,
document_embeddings=doc_embeddings,
top_k=2,
keyword_weight=0.5,
semantic_weight=0.5
)
# Format the results into a single context string
context = results_to_context(retrieved_results)
# --- 3. Generation Phase ---
# Construct the final prompt for the LLM.
final_prompt = f"""Answer the following question based on the provided context.
Context:
{context}
Question: {user_query}
Answer:"""
print("\n--- Final Prompt for LLM ---")
print(final_prompt)
This simple pipeline serves as the foundation for any RAG system. By assembling these components, you have successfully built a system that can ground LLM responses in your own data. The following sections will explore how to enhance this pipeline with more advanced techniques, such as re-ranking to improve relevance and context management to optimize for token limits.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with