Now that we've covered the individual components of Retrieval Augmented Generation (RAG), let's put them together to build a functional Question-Answering system. This practical exercise will guide you through creating a Python application that can answer questions based on the content of a provided text document. We will use common libraries to handle document processing, embeddings, vector storage, and interaction with an LLM API.

Prerequisites:

Before starting, ensure you have the necessary libraries installed. You'll typically need:

An LLM client library (e.g., openai for OpenAI models, anthropic for Anthropic models).
A library for embeddings (e.g., sentence-transformers or the client library itself if it provides embeddings).
A vector store library (e.g., faiss-cpu for a local FAISS index, or chromadb).
A library for text splitting, often included in frameworks like langchain. For this example, we'll assume access to text splitting functions.

You will also need an API key for your chosen LLM provider. Remember to handle your API keys securely, for instance, by using environment variables.

# Example environment setup
import os
# Load API key from environment variable
# os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY_HERE"
# os.environ["ANTHROPIC_API_KEY"] = "YOUR_API_KEY_HERE"

Step 1: Prepare Your Document

First, we need a document to query. Let's create a simple text file named my_document.txt with some content about Python dictionaries:

# Contents of my_document.txt
Python dictionaries are versatile data structures used for storing key-value pairs.
Keys must be immutable types like strings, numbers, or tuples (if they contain only immutable elements).
Values can be of any data type. Dictionaries are unordered in older Python versions (before 3.7),
but maintain insertion order in modern Python.

You can create a dictionary using curly braces {} or the dict() constructor.
Accessing values is done using square bracket notation with the key, e.g., my_dict['my_key'].
If a key is not found, a KeyError is raised. The .get() method can be used for safer access,
allowing a default value to be returned if the key is absent.

Common dictionary methods include .keys(), .values(), and .items() to retrieve views of keys,
values, and key-value tuples, respectively. The 'in' keyword checks for key existence.
Dictionaries are mutable, meaning you can add, remove, or update key-value pairs after creation.

Now, let's load and split this document in Python. We need to break it into smaller chunks that can be effectively embedded and retrieved.

# Assume a text splitter function is available
# e.g., from langchain.text_splitter import RecursiveCharacterTextSplitter
# text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)

def load_and_split_document(file_path):
    """Loads a text file and splits it into chunks."""
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            text = f.read()
        # Placeholder for a text splitting mechanism
        # In a real scenario, use a library like LangChain's text splitters
        # For simplicity, we split by paragraphs here. Adjust as needed.
        chunks = [p.strip() for p in text.split('\n\n') if p.strip()]
        print(f"Loaded and split document into {len(chunks)} chunks.")
        return chunks
    except FileNotFoundError:
        print(f"Error: Document not found at {file_path}")
        return []

document_chunks = load_and_split_document('my_document.txt')

# Example output (depending on the splitting logic):
# ['Python dictionaries are versatile data structures...',
#  'You can create a dictionary using curly braces {}...',
#  'Common dictionary methods include .keys(), .values()...']

Chunking strategy is important. Splitting by paragraphs might be too coarse for dense documents. Using libraries like LangChain provides more sophisticated splitters (e.g., RecursiveCharacterTextSplitter) that consider sentence structure or fixed character lengths with overlap, often leading to better retrieval results.

Step 2: Generate Embeddings and Create a Vector Index

Next, we convert our text chunks into numerical representations (embeddings) that capture their semantic meaning. We then store these embeddings in a vector store for efficient searching.

We'll use the sentence-transformers library for embeddings and faiss-cpu for a simple, local vector store.

# Requires: pip install sentence-transformers faiss-cpu
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# 1. Load an embedding model
#    Common choice: 'all-MiniLM-L6-v2' - fast and reasonably good
model_name = 'all-MiniLM-L6-v2'
embedding_model = SentenceTransformer(model_name)
print(f"Loaded embedding model: {model_name}")

# 2. Generate embeddings for each chunk
if document_chunks:
    chunk_embeddings = embedding_model.encode(document_chunks, convert_to_numpy=True)
    print(f"Generated embeddings of shape: {chunk_embeddings.shape}") # (num_chunks, embedding_dimension)

    # 3. Create a FAISS index
    embedding_dimension = chunk_embeddings.shape[1]
    # IndexFlatL2 uses Euclidean distance for similarity
    index = faiss.IndexFlatL2(embedding_dimension)

    # 4. Add embeddings to the index
    index.add(chunk_embeddings)
    print(f"Created FAISS index with {index.ntotal} vectors.")
else:
    print("No document chunks to process.")
    index = None # Ensure index is None if no data

At this point, our index (the FAISS vector store) holds the numerical representations of our document chunks, ready for searching. We also need to keep the original document_chunks list handy, as the index only stores vectors, not the text itself. The index allows us to quickly find vectors (and thus, corresponding text chunks) that are closest to a query vector.

Step 3: Retrieve Relevant Chunks

When a user asks a question, we first embed their query using the same embedding model. Then, we use the vector store (our FAISS index) to find the embeddings (and corresponding text chunks) most similar to the query embedding.

def retrieve_relevant_chunks(query, index, embedding_model, chunks, top_k=3):
    """Embeds the query and retrieves the top_k most relevant chunks."""
    if index is None or index.ntotal == 0:
        print("Vector index is not initialized or is empty.")
        return []
    if not chunks:
        print("No document chunks available.")
        return []

    # 1. Embed the query
    query_embedding = embedding_model.encode([query], convert_to_numpy=True)

    # 2. Search the index
    #    D: distances, I: indices of the nearest neighbors
    distances, indices = index.search(query_embedding, top_k)

    # 3. Get the corresponding text chunks
    relevant_chunks = [chunks[i] for i in indices[0]] # indices[0] because we searched for one query
    print(f"Retrieved {len(relevant_chunks)} relevant chunks for query: '{query}'")
    return relevant_chunks

# Example usage
user_query = "How do I safely access dictionary values in Python?"
if index:
    retrieved_chunks = retrieve_relevant_chunks(user_query, index, embedding_model, document_chunks)
    # Example Output:
    # Retrieved 3 relevant chunks for query: 'How do I safely access dictionary values in Python?'
    # ['You can create a dictionary using curly braces {}...',
    #  'Common dictionary methods include .keys(), .values()...',
    #  'Python dictionaries are versatile data structures...'] # Order depends on similarity
else:
    retrieved_chunks = []

The top_k parameter determines how many chunks to retrieve. Choosing the right k is a balance; too few might miss necessary context, while too many might dilute the prompt or exceed context window limits.

Step 4: Augment Prompt and Generate Answer

Now we combine the retrieved chunks (our context) with the original user query to form a detailed prompt for the LLM. This prompt instructs the LLM to answer the query based only on the provided context.

# Assume 'llm_client' is an initialized client for your chosen LLM API
# e.g., from openai import OpenAI; llm_client = OpenAI()
# or   from anthropic import Anthropic; llm_client = Anthropic()

def generate_answer(query, retrieved_chunks, llm_client):
    """Constructs the prompt and generates an answer using the LLM."""
    if not retrieved_chunks:
        return "I couldn't find relevant information in the document to answer your question."

    # 1. Construct the context string
    context = "\n\n".join(retrieved_chunks)

    # 2. Create the prompt
    prompt = f"""
    Based solely on the following context, please answer the question.
    If the context does not contain the answer, say "I cannot answer this question based on the provided context."

    Context:
    {context}

    Question: {query}

    Answer:
    """

    print("\n--- Sending Prompt to LLM ---")
    # print(prompt) # Optionally print the full prompt for debugging
    print("--- End of Prompt ---")

    try:
        # 3. Call the LLM API (Example using OpenAI's chat completion)
        # Adjust parameters (model, max_tokens, temperature) as needed
        response = llm_client.chat.completions.create(
            model="gpt-3.5-turbo", # Or another suitable model
            messages=[
                {"role": "system", "content": "You are a helpful assistant answering questions based on provided context."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.2, # Lower temperature for more factual answers
            max_tokens=150
        )
        answer = response.choices[0].message.content.strip()
        return answer
    except Exception as e:
        print(f"Error calling LLM API: {e}")
        return "Sorry, I encountered an error while generating the answer."

# Example usage (assuming llm_client is configured)
# final_answer = generate_answer(user_query, retrieved_chunks, llm_client)
# print(f"\nAnswer:\n{final_answer}")

# Example Expected Output:
# --- Sending Prompt to LLM ---
# --- End of Prompt ---
#
# Answer:
# To safely access dictionary values in Python and avoid a KeyError if the key is not found, you can use the .get() method. This method allows you to provide a default value that will be returned if the key is absent. For example, `my_dict.get('non_existent_key', 'default_value')`.

This prompt structure clearly separates the instruction, the context derived from your document, and the user's question. Instructing the LLM to rely only on the provided context is significant for grounding the response in the document's information.

Putting It All Together: A Basic RAG Pipeline

Let's combine these steps into a single script. Remember to replace placeholders for API keys and potentially adjust model names or library usage based on your specific setup.

import os
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from openai import OpenAI # Or your preferred LLM client

# --- Configuration ---
DOCUMENT_PATH = 'my_document.txt'
EMBEDDING_MODEL_NAME = 'all-MiniLM-L6-v2'
VECTOR_STORE_DIMENSION = 384 # Dimension for 'all-MiniLM-L6-v2'
TOP_K_RESULTS = 3
LLM_MODEL = "gpt-3.5-turbo" # Or another capable model

# --- Setup ---
# Securely load API Key (replace with your method)
# client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
client = OpenAI() # Assumes OPENAI_API_KEY is set in environment

# --- Functions (from previous steps) ---
def load_and_split_document(file_path):
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            text = f.read()
        chunks = [p.strip() for p in text.split('\n\n') if p.strip()]
        print(f"Loaded and split document into {len(chunks)} chunks.")
        return chunks
    except FileNotFoundError:
        print(f"Error: Document not found at {file_path}")
        return []

def setup_rag_components(chunks, model_name, dimension):
    if not chunks:
        return None, None
    try:
        embedding_model = SentenceTransformer(model_name)
        chunk_embeddings = embedding_model.encode(chunks, convert_to_numpy=True)
        index = faiss.IndexFlatL2(dimension)
        index.add(chunk_embeddings)
        print(f"Initialized FAISS index with {index.ntotal} vectors.")
        return embedding_model, index
    except Exception as e:
        print(f"Error initializing RAG components: {e}")
        return None, None

def retrieve_relevant_chunks(query, index, embedding_model, chunks, top_k):
    if index is None or index.ntotal == 0 or not chunks:
        print("Vector index or chunks not available for retrieval.")
        return []
    try:
        query_embedding = embedding_model.encode([query], convert_to_numpy=True)
        distances, indices = index.search(query_embedding, top_k)
        relevant_chunks = [chunks[i] for i in indices[0]]
        print(f"Retrieved {len(relevant_chunks)} relevant chunks.")
        return relevant_chunks
    except Exception as e:
        print(f"Error during retrieval: {e}")
        return []

def generate_answer(query, retrieved_chunks, llm_client, llm_model):
    if not retrieved_chunks:
        return "I couldn't find relevant information in the document to answer your question."
    context = "\n\n".join(retrieved_chunks)
    prompt = f"""
    Based solely on the following context, please answer the question.
    If the context does not contain the answer, say "I cannot answer this question based on the provided context."

    Context:
    {context}

    Question: {query}

    Answer:
    """
    try:
        response = llm_client.chat.completions.create(
            model=llm_model,
            messages=[
                {"role": "system", "content": "You are a helpful assistant answering questions based on provided context."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.2,
            max_tokens=150
        )
        answer = response.choices[0].message.content.strip()
        return answer
    except Exception as e:
        print(f"Error calling LLM API: {e}")
        return "Sorry, I encountered an error while generating the answer."

# --- Main Execution ---
if __name__ == "__main__":
    # 1. Load and Prepare Document
    document_chunks = load_and_split_document(DOCUMENT_PATH)

    # 2. Setup Embeddings and Vector Store
    embedding_model, vector_index = setup_rag_components(
        document_chunks, EMBEDDING_MODEL_NAME, VECTOR_STORE_DIMENSION
    )

    if embedding_model and vector_index:
        # 3. Get User Query
        user_query = input("Please enter your question about the document: ")

        # 4. Retrieve Relevant Context
        retrieved_context = retrieve_relevant_chunks(
            user_query, vector_index, embedding_model, document_chunks, TOP_K_RESULTS
        )

        # 5. Generate Answer
        final_answer = generate_answer(
            user_query, retrieved_context, client, LLM_MODEL
        )

        print("\n--- Final Answer ---")
        print(final_answer)
    else:
        print("Failed to set up RAG components. Exiting.")

Basic workflow of the RAG system built in this practical exercise.

Conclusion

You have successfully built a basic Retrieval Augmented Generation system! This application demonstrates how to connect an LLM with external knowledge contained in a document. By retrieving relevant text chunks based on semantic similarity and providing them as context, the LLM can generate answers grounded in specific information, overcoming the limitations of its static training data.

This is a foundational example. Real-world RAG systems often involve more sophisticated techniques, such as:

More advanced chunking strategies.
Different vector stores with more features (metadata filtering, cloud hosting).
Re-ranking retrieved results for relevance before sending them to the LLM.
Hybrid search approaches combining keyword and semantic search.
Evaluation frameworks to measure retrieval and generation quality.

Experiment with different documents, questions, embedding models, and LLM parameters to understand their impact on the results. This practical foundation prepares you for building more complex and effective RAG applications.

Was this section helpful?