Alright, let's bring together the concepts and components we've discussed to build and run a complete, albeit basic, Retrieve-Augmented Generation system. This practical exercise demonstrates how the retriever and generator work in concert to answer queries using external knowledge. We'll simulate the steps using common patterns and libraries, focusing on the flow rather than exhaustive implementation details.
Assume you have your Python environment set up with the necessary libraries installed, as covered in the "Setting up the Environment" section. This typically includes a library for vector embeddings (like sentence-transformers
), a vector database client (like chromadb
or faiss-cpu
), and an LLM interface (like openai
or huggingface_hub
). For simplicity, we might illustrate ideas using pseudocode or high-level framework calls (like those found in LangChain or LlamaIndex).
First, we need a small knowledge base. Let's imagine we have a few text snippets about different planets in our solar system. In a real application, you would load these from files (as discussed in Chapter 3), but here we'll define them directly:
# Sample documents representing our knowledge base
documents = [
"Mercury is the smallest planet in our Solar System and closest to the Sun.",
"Venus has a thick, toxic atmosphere filled with carbon dioxide and is perpetually shrouded in thick, yellowish clouds of sulfuric acid.",
"Earth is the third planet from the Sun and the only astronomical object known to harbor life.",
"Mars is often called the 'Red Planet' because of its reddish appearance, due to iron oxide prevalent on its surface."
]
# In a real scenario, we'd apply chunking strategies here.
# For simplicity, we'll treat each sentence as a 'chunk'.
# Assume metadata like {'source': 'solar_system_facts.txt', 'doc_id': i} is associated with each.
document_chunks = documents
Here, we're keeping it simple by treating each sentence as a document chunk. Remember from Chapter 3 that effective chunking is important for larger documents.
Next, we initialize the main building blocks: the embedding model, the vector store, and the LLM.
# 1. Embedding Model
# Load a pre-trained model to generate embeddings
# (e.g., from sentence-transformers library)
# embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
# 2. Vector Store (In-Memory ChromaDB example)
# import chromadb
# client = chromadb.Client() # In-memory client
# vector_store = client.create_collection("solar_system")
# 3. Large Language Model
# Initialize connection to an LLM
# (e.g., using OpenAI's library or Hugging Face's transformers)
# llm = OpenAI(api_key="YOUR_API_KEY") # Or load a local model
These steps set up the tools needed: one to convert text to vectors, one to store and search those vectors efficiently, and one to generate text based on input prompts.
Now, we process our document chunks and store them in the vector store. This involves generating an embedding for each chunk and adding the chunk text, its embedding, and any associated metadata to the vector database.
# (Continuing Example)
# Generate embeddings for each chunk
# chunk_embeddings = embedding_model.encode(document_chunks)
# Prepare data for indexing (text, embeddings, metadata)
# ids = [f"chunk_{i}" for i in range(len(document_chunks))]
# metadatas = [{'source': 'solar_system_facts.txt', 'doc_id': i} for i in range(len(document_chunks))]
# Add to the vector store
# vector_store.add(
# embeddings=chunk_embeddings,
# documents=document_chunks,
# metadatas=metadatas,
# ids=ids
# )
print("Data successfully indexed in the vector store.")
After this step, our knowledge base is ready to be queried. The vector store can now perform similarity searches to find chunks relevant to a user's question.
We now define the core logic that takes a user query, retrieves relevant context, and generates an answer using the LLM. This can be encapsulated in a function or, more commonly when using frameworks, defined as a chain or sequence.
# (Example)
def answer_query_with_rag(query: str):
# 1. Embed the query
# query_embedding = embedding_model.encode([query])[0]
# 2. Retrieve relevant chunks
# Search the vector store for the top_k most similar chunks
# retrieved_results = vector_store.query(
# query_embeddings=[query_embedding],
# n_results=2 # Retrieve top 2 relevant chunks
# )
# retrieved_docs = retrieved_results['documents'][0]
# print(f"Retrieved Documents: {retrieved_docs}") # Optional: inspect retrieved context
# 3. Construct the prompt
# Combine the original query with the retrieved context
# context_string = "\n\n".join(retrieved_docs)
# prompt_template = f"""
# Based on the following context, answer the query.
# If the context doesn't contain the answer, state that.
# Context:
# {context_string}
# Query: {query}
# Answer:
# """
# print(f"Generated Prompt:\n{prompt_template}") # Optional: inspect the prompt
# 4. Generate the response using the LLM
# response = llm.complete(prompt=prompt_template) # Or equivalent LLM API call
# generated_answer = response # Process the response object as needed
# For demonstration, we'll return placeholder values
generated_answer = "This is a placeholder answer based on retrieved context."
retrieved_docs = ["Document 1 text...", "Document 2 text..."]
return generated_answer, retrieved_docs
This function outlines the RAG flow: embed the query, search the vector store, use the results to build a detailed prompt, and finally, call the LLM with this augmented prompt.
Let's test our basic RAG system with a query related to our indexed data.
# Example Query
user_query = "Which planet is closest to the sun?"
# Execute the RAG process
final_answer, retrieved_context = answer_query_with_rag(user_query)
print("\n" + "="*30)
print(f"User Query: {user_query}")
# In a real run, retrieved_context would contain the actual document text
print(f"Retrieved Context: {retrieved_context}")
print(f"Generated Answer: {final_answer}")
print("="*30 + "\n")
# Another Example Query
user_query_2 = "What is Mars known for?"
final_answer_2, retrieved_context_2 = answer_query_with_rag(user_query_2)
print("="*30)
print(f"User Query: {user_query_2}")
print(f"Retrieved Context: {retrieved_context_2}")
print(f"Generated Answer: {final_answer_2}")
print("="*30)
If this were fully implemented, the first query should ideally retrieve the chunk about Mercury, and the LLM should generate an answer like "Mercury is the planet closest to the Sun." The second query should retrieve the chunk about Mars and generate an answer mentioning its reddish appearance or the 'Red Planet' nickname.
The process we just implemented can be visualized as follows:
A diagram illustrating the sequence of operations in the basic RAG pipeline built in this practical exercise.
Congratulations! You've walked through the construction and execution of an end-to-end RAG pipeline. We connected data preparation, vector storage, retrieval, prompt engineering, and LLM generation.
This example is intentionally simple. Real-world applications often involve more sophisticated data loading, advanced chunking strategies, potentially different embedding models or vector databases, more complex prompt templates, and handling of edge cases like queries with no relevant context found.
While functional, the quality and reliability of this basic system are yet to be determined. How do we know if the retriever is finding the best context? How do we assess if the LLM's generated answer is accurate and faithful to the retrieved information? These questions lead us directly into the next chapter, where we'll explore methods for evaluating RAG systems and strategies for improving their performance.
© 2025 ApX Machine Learning