Now that we have explored the core ideas behind Retrieval-Augmented Generation, vector stores, and how libraries like LlamaIndex facilitate connecting LLMs with data, let's put these pieces together. This practice session guides you through building a basic RAG application using Python and LlamaIndex. We will ingest a small amount of text data, index it using embeddings, and then query it using an LLM, retrieving relevant context first.ObjectiveCreate a simple question-answering application that uses RAG to answer questions based on a provided set of text documents.PrerequisitesBefore you start, ensure you have the necessary libraries installed. You'll primarily need LlamaIndex and an LLM provider library (like openai). You will also need a library for the vector store component; we'll use FAISS here, which requires the faiss-cpu package (or faiss-gpu if you have a compatible GPU and CUDA installed).pip install llama-index openai faiss-cpu python-dotenvRemember to set up your API keys securely, for instance, using environment variables and the python-dotenv library, as discussed in Chapter 2. For this example, we assume your OpenAI API key is accessible via an environment variable named OPENAI_API_KEY.Step 1: Setup and ImportsBegin by importing the necessary components from LlamaIndex and configuring your environment.import os import logging import sys from dotenv import load_dotenv # Load environment variables (especially OPENAI_API_KEY) load_dotenv() # Optional: Configure logging for visibility # logging.basicConfig(stream=sys.stdout, level=logging.INFO) # logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout)) from llama_index.core import ( VectorStoreIndex, SimpleDirectoryReader, StorageContext, load_index_from_storage, Document ) from llama_index.vector_stores.faiss import FaissVectorStore from llama_index.embeddings.openai import OpenAIEmbedding # Or use other embeddings from llama_index.core.node_parser import SentenceSplitter from llama_index.llms.openai import OpenAI # Or use other LLMs import faiss # Vector store library # Check if the API key is available if os.getenv("OPENAI_API_KEY") is None: raise ValueError("OPENAI_API_KEY environment variable not set.") print("Setup complete. Libraries imported and API key loaded.")This code imports the core LlamaIndex classes, the FAISS vector store integration, the OpenAI embeddings and LLM classes, and checks for the necessary API key.Step 2: Prepare Sample DataFor a simple RAG system, we need some data to query. Instead of loading from files initially, let's define a few text snippets directly as LlamaIndex Document objects. This makes the example self-contained.# Create sample Document objects text1 = """ Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model released in 2020 that uses deep learning to produce human-like text. Given an initial text as prompt, it will produce text that continues the prompt. """ text2 = """ Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge. Before generating a response, RAG models retrieve relevant information from a predefined knowledge source, such as a document collection or database. This retrieved context is then used to inform and ground the generation process, leading to more accurate and factual answers. """ text3 = """ Vector embeddings represent text (or other data types) as numerical vectors in a high-dimensional space. Similar concepts or texts are mapped to nearby points in this space. This allows for efficient semantic search, where queries find documents based on meaning rather than just keyword matching. These embeddings are important for the retrieval step in RAG systems. """ documents = [ Document(text=text1, doc_id="doc_gpt3"), Document(text=text2, doc_id="doc_rag"), Document(text=text3, doc_id="doc_embeddings") ] print(f"Created {len(documents)} sample documents.")We've created three distinct text passages related to LLMs, RAG, and embeddings, wrapping each in a Document object. Assigning a doc_id is good practice for tracking provenance.Step 3: Initialize Embeddings and LLMWe need to specify which embedding model to use for converting text to vectors and which LLM to use for generating the final answer. LlamaIndex integrates with various providers; here, we use OpenAI.# Initialize the embedding model embed_model = OpenAIEmbedding() # Initialize the LLM llm = OpenAI(model="gpt-3.5-turbo") # Or choose another model like gpt-4 print("Initialized OpenAI embedding model and LLM.")Step 4: Set Up the Vector StoreNow, let's create an instance of our chosen vector store, FAISS. We define the dimensionality of the vectors, which depends on the embedding model used (OpenAI's text-embedding-ada-002, the default for OpenAIEmbedding, produces 1536-dimensional vectors).# Dimension of vectors for OpenAI ada-002 d = 1536 faiss_index = faiss.IndexFlatL2(d) # Using L2 distance for similarity # Instantiate the FaissVectorStore vector_store = FaissVectorStore(faiss_index=faiss_index) print("FAISS vector store initialized.")We create a basic FAISS index (IndexFlatL2) suitable for smaller datasets where exhaustive search is feasible. IndexFlatL2 calculates the L2 (Euclidean) distance between the query vector and all indexed vectors to find the nearest neighbors.Step 5: Create the IndexWith the documents, embedding model, and vector store ready, we can create the index. LlamaIndex handles the process of chunking the documents (if necessary), generating embeddings for each chunk, and storing them in the vector store. We'll also define a storage context to link the vector store.# Define a storage context that uses our FAISS vector store storage_context = StorageContext.from_defaults(vector_store=vector_store) # Define a text splitter (optional but good practice) # This helps break down larger documents if needed node_parser = SentenceSplitter(chunk_size=100, chunk_overlap=20) # Build the index # This process involves: # 1. Parsing documents into nodes (chunks) # 2. Generating embeddings for each node using embed_model # 3. Storing nodes and their embeddings in the vector_store index = VectorStoreIndex.from_documents( documents, storage_context=storage_context, embed_model=embed_model, node_parser=node_parser, # Use the defined parser llm=llm # Associate the LLM for potential index-time operations ) print("Index created and data embedded into FAISS.") # Optional: Persist the index to disk for later use # index.storage_context.persist("./my_rag_index") # print("Index persisted to disk.") # Optional: Load index from disk if it exists # try: # storage_context = StorageContext.from_defaults( # vector_store=vector_store, persist_dir="./my_rag_index" # ) # index = load_index_from_storage(storage_context, embed_model=embed_model, llm=llm) # print("Index loaded from disk.") # except FileNotFoundError: # print("Index not found on disk, creating a new one.") # # (Code to build index as above) # index.storage_context.persist("./my_rag_index") Here, VectorStoreIndex.from_documents is the core function. It takes our list of Document objects, orchestrates the embedding generation via embed_model, uses the node_parser to potentially split text into manageable chunks (nodes), and stores the results in the vector_store defined within the storage_context. Associating the llm at index time might be used for certain advanced indexing strategies, though not strictly required for this basic setup. We also show commented-out code for persisting and reloading the index, which is useful for larger datasets where indexing takes time.Step 6: Create a Query EngineTo interact with the indexed data, LlamaIndex provides query engines. A basic query engine retrieves relevant context from the index based on the query and then passes the query and context to the LLM for synthesis.# Create a query engine from the index # similarity_top_k=2 means retrieve the top 2 most similar nodes query_engine = index.as_query_engine(similarity_top_k=2, llm=llm) print("Query engine created.")as_query_engine() is a convenient method on the index object. We specify similarity_top_k=2 to retrieve the two most relevant text chunks (nodes) from our vector store for each query. The llm instance is passed again to be used for the final answer generation step.Step 7: Query the DataFinally, let's ask a question related to our indexed documents.# Define a query query_text = "How does RAG improve LLM responses?" # Execute the query response = query_engine.query(query_text) # Print the response print("\nQuery:", query_text) print("\nResponse:") print(response) # The synthesized answer from the LLM # Optional: Inspect the retrieved source nodes # print("\nSource Nodes:") # for node in response.source_nodes: # print(f" Score: {node.score:.4f}") # print(f" Content: {node.get_content().strip()}") # print("-" * 20)The query_engine.query() method performs the RAG process:It takes the query_text.Generates an embedding for the query using embed_model.Searches the vector_store (FAISS) for the similarity_top_k nodes with embeddings closest to the query embedding.Constructs a new prompt containing the original query_text and the content of the retrieved nodes.Sends this augmented prompt to the llm.Returns the LLM's generated response.Expected OutcomeWhen you run the full script, you should see output similar to this (the exact wording of the LLM response might vary slightly):Setup complete. Libraries imported and API key loaded. Created 3 sample documents. Initialized OpenAI embedding model and LLM. FAISS vector store initialized. Index created and data embedded into FAISS. Query engine created. Query: How does RAG improve LLM responses? Response: Retrieval-Augmented Generation (RAG) improves LLM responses by integrating external knowledge. Before generating a response, RAG models retrieve relevant information from a knowledge source like documents or databases. This retrieved context grounds the generation process, leading to more accurate and factual answers.If you uncomment the code to print source nodes, you'll see the specific text chunks retrieved from the documents (likely the content from doc_rag and possibly doc_embeddings depending on similarity scores) that the LLM used to formulate its answer.SummaryIn this practice session, you successfully built a basic RAG pipeline using Python, LlamaIndex, OpenAI, and FAISS. You ingested text data, created vector embeddings, stored them in a vector database, and used a query engine to retrieve relevant context and generate an informed answer from an LLM. This demonstrates the core workflow of RAG: retrieve relevant information first, then generate the response based on that information. You can adapt this pattern by changing the data source, embedding models, vector stores, or LLMs to build more sophisticated RAG applications.