Building and running a complete, basic Retrieve-Augmented Generation (RAG) system is presented. This practical exercise demonstrates how the retriever and generator work in concert to answer queries using external knowledge. Steps are simulated using common patterns and libraries, focusing on the flow rather than exhaustive implementation details.Assume you have your Python environment set up with the necessary libraries installed, as covered in the "Setting up the Environment" section. This typically includes a library for vector embeddings (like sentence-transformers), a vector database client (like chromadb or faiss-cpu), and an LLM interface (like openai or huggingface_hub). For simplicity, we might illustrate ideas using pseudocode or high-level framework calls (like those found in LangChain or LlamaIndex).Preparing Sample DataFirst, we need a small knowledge base. Let's imagine we have a few text snippets about different planets in our solar system. In a real application, you would load these from files (as discussed in Chapter 3), but here we'll define them directly:# Sample documents representing our knowledge base documents = [ "Mercury is the smallest planet in our Solar System and closest to the Sun.", "Venus has a thick, toxic atmosphere filled with carbon dioxide and is perpetually shrouded in thick, yellowish clouds of sulfuric acid.", "Earth is the third planet from the Sun and the only astronomical object known to harbor life.", "Mars is often called the 'Red Planet' because of its reddish appearance, due to iron oxide prevalent on its surface." ] # In a real scenario, we'd apply chunking strategies here. # For simplicity, we'll treat each sentence as a 'chunk'. # Assume metadata like {'source': 'solar_system_facts.txt', 'doc_id': i} is associated with each. document_chunks = documentsHere, we're keeping it simple by treating each sentence as a document chunk. Remember from Chapter 3 that effective chunking is important for larger documents.Initializing Core ComponentsNext, we initialize the main building blocks: the embedding model, the vector store, and the LLM.# 1. Embedding Model # Load a pre-trained model to generate embeddings # (e.g., from sentence-transformers library) # embedding_model = SentenceTransformer('all-MiniLM-L6-v2') # 2. Vector Store (In-Memory ChromaDB example) # import chromadb # client = chromadb.Client() # In-memory client # vector_store = client.create_collection("solar_system") # 3. Large Language Model # Initialize connection to an LLM # (e.g., using OpenAI's library or Hugging Face's transformers) # llm = OpenAI(api_key="YOUR_API_KEY") # Or load a local modelThese steps set up the tools needed: one to convert text to vectors, one to store and search those vectors efficiently, and one to generate text based on input prompts.Indexing the DataNow, we process our document chunks and store them in the vector store. This involves generating an embedding for each chunk and adding the chunk text, its embedding, and any associated metadata to the vector database.# (Continuing Example) # Generate embeddings for each chunk # chunk_embeddings = embedding_model.encode(document_chunks) # Prepare data for indexing (text, embeddings, metadata) # ids = [f"chunk_{i}" for i in range(len(document_chunks))] # metadatas = [{'source': 'solar_system_facts.txt', 'doc_id': i} for i in range(len(document_chunks))] # Add to the vector store # vector_store.add( # embeddings=chunk_embeddings, # documents=document_chunks, # metadatas=metadatas, # ids=ids # ) print("Data successfully indexed in the vector store.")After this step, our knowledge base is ready to be queried. The vector store can now perform similarity searches to find chunks relevant to a user's question.Defining the RAG ProcessWe now define the core logic that takes a user query, retrieves relevant context, and generates an answer using the LLM. This can be encapsulated in a function or, more commonly when using frameworks, defined as a chain or sequence.# (Example) def answer_query_with_rag(query: str): # 1. Embed the query # query_embedding = embedding_model.encode([query])[0] # 2. Retrieve relevant chunks # Search the vector store for the top_k most similar chunks # retrieved_results = vector_store.query( # query_embeddings=[query_embedding], # n_results=2 # Retrieve top 2 relevant chunks # ) # retrieved_docs = retrieved_results['documents'][0] # print(f"Retrieved Documents: {retrieved_docs}") # Optional: inspect retrieved context # 3. Construct the prompt # Combine the original query with the retrieved context # context_string = "\n\n".join(retrieved_docs) # prompt_template = f""" # Based on the following context, answer the query. # If the context doesn't contain the answer, state that. # Context: # {context_string} # Query: {query} # Answer: # """ # print(f"Generated Prompt:\n{prompt_template}") # Optional: inspect the prompt # 4. Generate the response using the LLM # response = llm.complete(prompt=prompt_template) # Or equivalent LLM API call # generated_answer = response # Process the response object as needed # For demonstration, we'll return placeholder values generated_answer = "This is a placeholder answer based on retrieved context." retrieved_docs = ["Document 1 text...", "Document 2 text..."] return generated_answer, retrieved_docs This function outlines the RAG flow: embed the query, search the vector store, use the results to build a detailed prompt, and finally, call the LLM with this augmented prompt.Running and Testing the PipelineLet's test our basic RAG system with a query related to our indexed data.# Example Query user_query = "Which planet is closest to the sun?" # Execute the RAG process final_answer, retrieved_context = answer_query_with_rag(user_query) print("\n" + "="*30) print(f"User Query: {user_query}") # In a real run, retrieved_context would contain the actual document text print(f"Retrieved Context: {retrieved_context}") print(f"Generated Answer: {final_answer}") print("="*30 + "\n") # Another Example Query user_query_2 = "What is Mars known for?" final_answer_2, retrieved_context_2 = answer_query_with_rag(user_query_2) print("="*30) print(f"User Query: {user_query_2}") print(f"Retrieved Context: {retrieved_context_2}") print(f"Generated Answer: {final_answer_2}") print("="*30) If this were fully implemented, the first query should ideally retrieve the chunk about Mercury, and the LLM should generate an answer like "Mercury is the planet closest to the Sun." The second query should retrieve the chunk about Mars and generate an answer mentioning its reddish appearance or the 'Red Planet' nickname.Visualizing the FlowThe process we just implemented can be visualized as follows:digraph RAG_Pipeline { rankdir=LR; node [shape=box, style="filled", fillcolor="#a5d8ff", fontname="Arial"]; edge [fontname="Arial"]; UserQuery [label="User Query", fillcolor="#ffec99"]; EmbedQuery [label="Embed Query\n(Embedding Model)", fillcolor="#bac8ff"]; VectorStore [label="Search Vector Store\n(ChromaDB/FAISS)", shape=cylinder, fillcolor="#96f2d7"]; RetrieveDocs [label="Retrieve Relevant\nDocument Chunks", fillcolor="#b2f2bb"]; FormatPrompt [label="Format Prompt\n(Query + Context)", fillcolor="#ffd8a8"]; LLM [label="Generate Response\n(LLM)", fillcolor="#fcc2d7"]; FinalAnswer [label="Final Answer", fillcolor="#ffec99"]; UserQuery -> EmbedQuery; EmbedQuery -> VectorStore [label="Query Embedding"]; VectorStore -> RetrieveDocs [label="Similarity Search"]; RetrieveDocs -> FormatPrompt [label="Top-k Chunks"]; UserQuery -> FormatPrompt [label="Original Query"]; FormatPrompt -> LLM [label="Augmented Prompt"]; LLM -> FinalAnswer [label="Generated Text"]; }A diagram illustrating the sequence of operations in the basic RAG pipeline built in this practical exercise.Summary and Next StepsCongratulations! You've walked through the construction and execution of an end-to-end RAG pipeline. We connected data preparation, vector storage, retrieval, prompt engineering, and LLM generation."This example is intentionally simple. Applications often involve more sophisticated data loading, advanced chunking strategies, potentially different embedding models or vector databases, more complex prompt templates, and handling of edge cases like queries with no relevant context found."While functional, the quality and reliability of this basic system are yet to be determined. How do we know if the retriever is finding the best context? How do we assess if the LLM's generated answer is accurate and faithful to the retrieved information? These questions lead us directly into the next chapter, where we'll explore methods for evaluating RAG systems and strategies for improving their performance.