Now that you understand the core ideas behind Retrieval-Augmented Generation and the roles of components like vector stores and embeddings, let's walk through the practical steps of building a RAG pipeline. This process connects your external data source to the Language Model, enabling grounded and contextually relevant responses.
The fundamental flow of a RAG system involves retrieving relevant information based on a user's query and then providing that information, along with the original query, to the LLM to generate a final answer.
A typical RAG pipeline flow: Query -> Retrieve -> Augment -> Generate.
Let's break down the construction into distinct stages, often implemented using libraries like LangChain or LlamaIndex.
First, you need to ingest the external knowledge source. This could be text files, PDFs, web pages, database entries, or other formats. Data loaders provided by LlamaIndex or LangChain's document loaders simplify this process.
Once loaded, the raw text usually needs to be split into smaller, manageable chunks. This is important because:
Common strategies involve splitting by paragraphs, sentences, or using recursive character splitting that tries to maintain semantic coherence.
# Example using LangChain's text splitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader
# Assume 'my_document.txt' contains your data
loader = TextLoader('my_document.txt')
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # Max characters per chunk
chunk_overlap=50 # Overlap between chunks
)
text_chunks = text_splitter.split_documents(documents)
print(f"Loaded {len(documents)} document(s).")
print(f"Split into {len(text_chunks)} chunks.")
# Output might be:
# Loaded 1 document(s).
# Split into 42 chunks.
Each text chunk needs to be converted into a numerical vector representation, known as an embedding. This vector captures the semantic meaning of the text. Models like OpenAI's text-embedding-ada-002
or open-source models from Hugging Face's sentence-transformers
library are commonly used. The choice of embedding model can significantly impact retrieval quality.
# Example using OpenAI embeddings via LangChain
from langchain_openai import OpenAIEmbeddings
# Ensure OPENAI_API_KEY is set in your environment
embeddings_model = OpenAIEmbeddings()
# You would typically embed text_chunks here,
# but vector stores often handle this implicitly during indexing.
# Example embedding a single text:
# sample_embedding = embeddings_model.embed_query("This is a sample text chunk.")
# print(f"Embedding dimension: {len(sample_embedding)}")
# Output might be: Embedding dimension: 1536
The generated embeddings, along with their corresponding text chunks, are stored in a specialized database called a vector store. This store is optimized for efficient similarity searches. When a query comes in, the vector store can quickly find the text chunks whose embeddings are closest (most similar) to the query's embedding.
Popular choices include Chroma, FAISS (often used locally), Pinecone, Weaviate, and many others.
# Example setting up Chroma vector store with LangChain
from langchain_community.vectorstores import Chroma
# This combines embedding and indexing for the prepared text_chunks
vector_store = Chroma.from_documents(
documents=text_chunks,
embedding=embeddings_model,
persist_directory="./chroma_db" # Optional: Save to disk
)
print("Vector store created and populated.")
LlamaIndex provides a similar streamlined process for loading, splitting, embedding, and indexing data into various vector stores using its
VectorStoreIndex
abstraction.
The retriever is the component responsible for fetching relevant information from the vector store based on the user's query. It takes the query, generates its embedding using the same model used for indexing, and performs a similarity search (e.g., cosine similarity or dot product) against the embeddings in the vector store. It typically returns the top k
most similar chunks.
# Example creating a retriever from the LangChain vector store
retriever = vector_store.as_retriever(search_kwargs={"k": 3}) # Retrieve top 3 chunks
# Example retrieving documents for a query
query = "What are the best practices for LLM deployment?"
retrieved_docs = retriever.invoke(query)
print(f"Retrieved {len(retrieved_docs)} documents for the query.")
# for doc in retrieved_docs:
# print(f" - {doc.page_content[:100]}...") # Print start of retrieved content
The core idea of RAG is to augment the prompt sent to the LLM. Instead of just sending the user's raw query, you construct a new prompt that includes the content of the documents retrieved in the previous step. This provides the necessary context for the LLM.
A common pattern uses a prompt template:
from langchain.prompts import PromptTemplate
template = """
Answer the following question based only on the provided context:
Context:
{context}
Question:
{question}
Answer:
"""
rag_prompt = PromptTemplate.from_template(template)
The context
placeholder will be filled with the content from retrieved_docs
, and question
will be the original user query.
Finally, the augmented prompt is sent to the LLM (e.g., GPT-4, Claude 3, Llama 3) to generate the final answer. The LLM uses both the original question and the provided context to formulate its response, grounding it in the retrieved information.
Frameworks like LangChain provide abstractions (like the RetrievalQA
chain or using LangChain Expression Language - LCEL) to string these steps together neatly.
# Example using LangChain's RetrievalQA chain
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
# Initialize the LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
# Create the RetrievalQA chain
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # 'stuff' puts all retrieved docs directly into the context
retriever=retriever,
chain_type_kwargs={"prompt": rag_prompt} # Use our custom prompt
)
# Run the chain with the user query
query = "What are the key challenges in testing LLM systems?"
response = qa_chain.invoke({"query": query})
print("\nLLM Response:")
print(response['result'])
The
stuff
chain type is simple but can fail if the retrieved documents exceed the LLM's context window limit. Other chain types likemap_reduce
orrefine
handle larger amounts of context differently. LlamaIndex query engines offer similar functionality.
This sequence loading, splitting, embedding, indexing, retrieving, augmenting, and generating forms the backbone of a RAG pipeline. While this outlines the basic structure, building effective RAG systems often involves careful tuning of each step, including the chunking strategy, choice of embedding models, retrieval parameters, and the prompt itself, which we will explore further.
© 2025 ApX Machine Learning