Now that we have explored the core ideas behind Retrieval-Augmented Generation, vector stores, and how libraries like LlamaIndex facilitate connecting LLMs with data, let's put these pieces together. This practice session guides you through building a basic RAG application using Python and LlamaIndex. We will ingest a small amount of text data, index it using embeddings, and then query it using an LLM, retrieving relevant context first.
Create a simple question-answering application that uses RAG to answer questions based on a provided set of text documents.
Before you start, ensure you have the necessary libraries installed. You'll primarily need LlamaIndex and an LLM provider library (like openai
). You will also need a library for the vector store component; we'll use FAISS here, which requires the faiss-cpu
package (or faiss-gpu
if you have a compatible GPU and CUDA installed).
pip install llama-index openai faiss-cpu python-dotenv
Remember to set up your API keys securely, for instance, using environment variables and the python-dotenv
library, as discussed in Chapter 2. For this example, we assume your OpenAI API key is accessible via an environment variable named OPENAI_API_KEY
.
Begin by importing the necessary components from LlamaIndex and configuring your environment.
import os
import logging
import sys
from dotenv import load_dotenv
# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()
# Optional: Configure logging for visibility
# logging.basicConfig(stream=sys.stdout, level=logging.INFO)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
StorageContext,
load_index_from_storage,
Document
)
from llama_index.vector_stores.faiss import FaissVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding # Or use other embeddings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI # Or use other LLMs
import faiss # Vector store library
# Check if the API key is available
if os.getenv("OPENAI_API_KEY") is None:
raise ValueError("OPENAI_API_KEY environment variable not set.")
print("Setup complete. Libraries imported and API key loaded.")
This code imports the core LlamaIndex classes, the FAISS vector store integration, the OpenAI embeddings and LLM classes, and checks for the necessary API key.
For a simple RAG system, we need some data to query. Instead of loading from files initially, let's define a few text snippets directly as LlamaIndex Document
objects. This makes the example self-contained.
# Create sample Document objects
text1 = """
Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model
released in 2020 that uses deep learning to produce human-like text.
Given an initial text as prompt, it will produce text that continues the prompt.
"""
text2 = """
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs)
by integrating external knowledge. Before generating a response, RAG models
retrieve relevant information from a predefined knowledge source, such as a
document collection or database. This retrieved context is then used to inform
and ground the generation process, leading to more accurate and factual answers.
"""
text3 = """
Vector embeddings represent text (or other data types) as numerical vectors
in a high-dimensional space. Similar concepts or texts are mapped to nearby points
in this space. This allows for efficient semantic search, where queries find
documents based on meaning rather than just keyword matching. These embeddings
are crucial for the retrieval step in RAG systems.
"""
documents = [
Document(text=text1, doc_id="doc_gpt3"),
Document(text=text2, doc_id="doc_rag"),
Document(text=text3, doc_id="doc_embeddings")
]
print(f"Created {len(documents)} sample documents.")
We've created three distinct text passages related to LLMs, RAG, and embeddings, wrapping each in a Document
object. Assigning a doc_id
is good practice for tracking provenance.
We need to specify which embedding model to use for converting text to vectors and which LLM to use for generating the final answer. LlamaIndex integrates with various providers; here, we use OpenAI.
# Initialize the embedding model
embed_model = OpenAIEmbedding()
# Initialize the LLM
llm = OpenAI(model="gpt-3.5-turbo") # Or choose another model like gpt-4
print("Initialized OpenAI embedding model and LLM.")
Now, let's create an instance of our chosen vector store, FAISS. We define the dimensionality of the vectors, which depends on the embedding model used (OpenAI's text-embedding-ada-002
, the default for OpenAIEmbedding
, produces 1536-dimensional vectors).
# Dimension of vectors for OpenAI ada-002
d = 1536
faiss_index = faiss.IndexFlatL2(d) # Using L2 distance for similarity
# Instantiate the FaissVectorStore
vector_store = FaissVectorStore(faiss_index=faiss_index)
print("FAISS vector store initialized.")
We create a basic FAISS index (IndexFlatL2
) suitable for smaller datasets where exhaustive search is feasible. IndexFlatL2
calculates the L2 (Euclidean) distance between the query vector and all indexed vectors to find the nearest neighbors.
With the documents, embedding model, and vector store ready, we can create the index. LlamaIndex handles the process of chunking the documents (if necessary), generating embeddings for each chunk, and storing them in the vector store. We'll also define a storage context to link the vector store.
# Define a storage context that uses our FAISS vector store
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# Define a text splitter (optional but good practice)
# This helps break down larger documents if needed
node_parser = SentenceSplitter(chunk_size=100, chunk_overlap=20)
# Build the index
# This process involves:
# 1. Parsing documents into nodes (chunks)
# 2. Generating embeddings for each node using embed_model
# 3. Storing nodes and their embeddings in the vector_store
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
embed_model=embed_model,
node_parser=node_parser, # Use the defined parser
llm=llm # Associate the LLM for potential index-time operations
)
print("Index created and data embedded into FAISS.")
# Optional: Persist the index to disk for later use
# index.storage_context.persist("./my_rag_index")
# print("Index persisted to disk.")
# Optional: Load index from disk if it exists
# try:
# storage_context = StorageContext.from_defaults(
# vector_store=vector_store, persist_dir="./my_rag_index"
# )
# index = load_index_from_storage(storage_context, embed_model=embed_model, llm=llm)
# print("Index loaded from disk.")
# except FileNotFoundError:
# print("Index not found on disk, creating a new one.")
# # (Code to build index as above)
# index.storage_context.persist("./my_rag_index")
Here, VectorStoreIndex.from_documents
is the core function. It takes our list of Document
objects, orchestrates the embedding generation via embed_model
, uses the node_parser
to potentially split text into manageable chunks (nodes), and stores the results in the vector_store
defined within the storage_context
. Associating the llm
at index time might be used for certain advanced indexing strategies, though not strictly required for this basic setup. We also show commented-out code for persisting and reloading the index, which is useful for larger datasets where indexing takes time.
To interact with the indexed data, LlamaIndex provides query engines. A basic query engine retrieves relevant context from the index based on the query and then passes the query and context to the LLM for synthesis.
# Create a query engine from the index
# similarity_top_k=2 means retrieve the top 2 most similar nodes
query_engine = index.as_query_engine(similarity_top_k=2, llm=llm)
print("Query engine created.")
as_query_engine()
is a convenient method on the index object. We specify similarity_top_k=2
to retrieve the two most relevant text chunks (nodes) from our vector store for each query. The llm
instance is passed again to be used for the final answer generation step.
Finally, let's ask a question related to our indexed documents.
# Define a query
query_text = "How does RAG improve LLM responses?"
# Execute the query
response = query_engine.query(query_text)
# Print the response
print("\nQuery:", query_text)
print("\nResponse:")
print(response) # The synthesized answer from the LLM
# Optional: Inspect the retrieved source nodes
# print("\nSource Nodes:")
# for node in response.source_nodes:
# print(f" Score: {node.score:.4f}")
# print(f" Content: {node.get_content().strip()}")
# print("-" * 20)
The query_engine.query()
method performs the RAG process:
query_text
.embed_model
.vector_store
(FAISS) for the similarity_top_k
nodes with embeddings closest to the query embedding.query_text
and the content of the retrieved nodes.llm
.When you run the full script, you should see output similar to this (the exact wording of the LLM response might vary slightly):
Setup complete. Libraries imported and API key loaded.
Created 3 sample documents.
Initialized OpenAI embedding model and LLM.
FAISS vector store initialized.
Index created and data embedded into FAISS.
Query engine created.
Query: How does RAG improve LLM responses?
Response:
Retrieval-Augmented Generation (RAG) improves LLM responses by integrating external knowledge. Before generating a response, RAG models retrieve relevant information from a knowledge source like documents or databases. This retrieved context grounds the generation process, leading to more accurate and factual answers.
If you uncomment the code to print source nodes, you'll see the specific text chunks retrieved from the documents (likely the content from doc_rag
and possibly doc_embeddings
depending on similarity scores) that the LLM used to formulate its answer.
In this practice session, you successfully built a basic RAG pipeline using Python, LlamaIndex, OpenAI, and FAISS. You ingested text data, created vector embeddings, stored them in a vector database, and used a query engine to retrieve relevant context and generate an informed answer from an LLM. This demonstrates the core workflow of RAG: retrieve relevant information first, then generate the response based on that information. You can adapt this pattern by changing the data source, embedding models, vector stores, or LLMs to build more sophisticated RAG applications.
© 2025 ApX Machine Learning