A Retrieval-Augmented Generation system extends a standard Large Language Model by connecting it to an external knowledge base. Instead of relying solely on its training data, the model can pull in relevant, up-to-date information to answer questions. This process isn't monolithic; it's a multi-stage pipeline composed of two primary phases: an offline Indexing Pipeline for data preparation and an online Retrieval & Generation Pipeline for answering user queries.
Understanding this two-part architecture is fundamental to building and troubleshooting RAG systems. The indexing phase prepares your knowledge, while the retrieval and generation phase uses that knowledge to create informed responses.
A high-level view of the two main phases in a RAG system. The indexing pipeline prepares the knowledge base, and the retrieval pipeline uses it to answer queries at inference time.
Before you can answer any questions, you need to build a searchable knowledge base from your source documents. This is typically done offline and involves a series of data processing steps.
The first step is to load your data. This could be a collection of text files, PDFs, Markdown documents, or content scraped from a website. In previous chapters, you saw how the document module can handle various sources and convert them into a standardized format. The goal is to get your raw text into memory so it can be processed.
Large documents are too big to fit into an LLM's context window. Furthermore, searching over entire documents is inefficient and can return noisy, irrelevant information. To solve this, you must split the documents into smaller, more manageable pieces, or "chunks."
Effective chunking is significant for RAG performance. Chunks should be small enough to contain specific, focused information but large enough to retain meaningful context. Strategies like recursive character splitting or sentence-based windowing, which you explored in Chapter 4, are designed to create semantically coherent chunks.
Once you have your document chunks, you need a way to search them based on meaning, not just keywords. This is where embeddings come in. Using an embedding model, each chunk of text is converted into a numerical vector that represents its semantic content. These vectors are the foundation of semantic search. The embed_batch function is ideal for efficiently creating embeddings for all your chunks at once.
With embeddings for every chunk, you need a place to store them for fast and scalable retrieval. A vector index (or vector database) is a specialized data structure designed for this purpose. It indexes the high-dimensional vectors so that you can quickly find the vectors (and their corresponding text chunks) that are most similar to a given query vector. While this course uses a simple in-memory search for demonstration, production systems typically use dedicated vector databases like Pinecone, Weaviate, or Chroma.
This is the online phase that executes whenever a user submits a query. It uses the pre-built index to find relevant information and generate an answer.
The process begins with a user's question, for example, "How do I implement authentication in a Python FastAPI app?". Just like the document chunks, this query is converted into an embedding vector using the same embedding model. This ensures that the query and the documents are represented in the same vector space, making them comparable.
The query vector is then used to search the vector index. The index performs a similarity search (often using cosine similarity) to find the document chunk vectors that are closest to the query vector. These chunks are considered the most semantically relevant to the user's question.
The retrieval module provides functions like semantic_search to perform this operation. You retrieve the top-k most relevant results, where is a configurable number (typically between 3 and 5).
# In a real application, documents and embeddings would be pre-computed
# and stored in a vector index.
# For demonstration, we pass them directly.
# Embed the user's query
query_embedding = embed("How do I implement authentication in Python?")
# Find the top 3 most relevant document chunks
retrieved_results = semantic_search(
query_embedding=query_embedding,
documents=all_document_chunks,
document_embeddings=all_chunk_embeddings,
top_k=3
)
# The 'retrieved_results' now contain the most relevant chunks of text
for result in retrieved_results:
print(f"Retrieved Document (Score: {result.score:.2f}):")
print(f"{result.document.content[:150]}...")
This is the "Augmented" part of Retrieval-Augmented Generation. Instead of sending the user's query directly to the LLM, you construct a new, more detailed prompt. This prompt includes the retrieved document chunks as context, along with the original query.
This gives the LLM the specific information it needs to formulate an accurate answer. The results_to_context function helps format the retrieved documents into a clean string to be inserted into a prompt template.
# Format the retrieved search results into a single string
context_string = results_to_context(retrieved_results)
# Create an augmented prompt for the LLM
augmented_prompt = f"""Answer the question based on the following context.
Context:
{context_string}
Question: How do I implement authentication in Python?
Answer:"""
Finally, the augmented prompt is sent to a large language model. The model uses the provided context to generate a response that is grounded in the information from your documents. Because the LLM has the relevant facts right in its prompt, it is far less likely to "hallucinate" or provide incorrect information. The resulting answer is accurate, relevant, and based on your specific knowledge base.
This completes the RAG anatomy. By separating the data preparation from the query processing, you create an efficient system that can answer questions over amounts of private or up-to-date information. Now, let's move on to building your first end-to-end retrieval pipeline.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with