Architecture of a RAG System

A Retrieval Augmented Generation system operates through a structured, multi-stage pipeline designed to ground a Large Language Model's responses in external data. This process can be separated into two primary phases: an offline indexing phase where the knowledge base is prepared, and an online retrieval and generation phase that serves user queries in real time. Understanding this two-part architecture is fundamental to building effective RAG applications.

The following diagram illustrates the complete workflow, showing how data is processed during indexing and how a query is handled to produce an answer.

The RAG workflow consists of an offline Indexing Pipeline to prepare data and an online Query Pipeline to generate answers. The Retriever searches the Vector Store to find relevant context, which is formatted into a prompt for the LLM.

Let's examine each phase in more detail.

The Indexing Pipeline

The goal of the indexing pipeline is to take a collection of unstructured documents and convert them into a structured, searchable knowledge base. This is a preparatory step that is typically run once or periodically as new data becomes available. It involves a sequence of data processing operations.

1. Loading

The process begins with your raw data, which can exist in various formats like PDF files, web pages, text files, or entries in a database. A DocumentLoader is used to ingest this data into a standardized format that LangChain can work with. Each loaded source is represented as a Document object, which contains the text content and associated metadata.

2. Splitting

Large documents often exceed the context window of an LLM and are inefficient for targeted retrieval. If you provide a 100-page document as context for a very specific question, the model may struggle to locate the relevant information. To address this, TextSplitters are used to break large Document objects into smaller, more manageable chunks. The art of splitting is to create chunks that are small enough to be processed efficiently but large enough to retain their semantic meaning.

3. Embedding

Once the text is chunked, it needs to be converted into a format that a machine can use for comparison. This is where embeddings come in. An embedding model, which is a separate neural network, converts each text chunk into a high-dimensional numerical vector. These vectors are designed so that chunks with similar semantic meaning are located close to each other in the vector space. For example, the vector for "how to make a pizza" will be closer to the vector for "pizza dough recipe" than to the vector for "stock market trends."

4. Storing

The final step in indexing is to store the text chunks and their corresponding embedding vectors in a specialized database called a VectorStore. This type of database is optimized for performing extremely fast similarity searches on high-dimensional vectors. Given an input vector, the vector store can efficiently find the vectors in its index that are most similar, typically using algorithms like Approximate Nearest Neighbor (ANN). Popular vector stores include FAISS, Chroma, and Pinecone.

The Retrieval and Generation Pipeline

This pipeline activates when a user submits a query. It uses the pre-built index to fetch relevant information and generate an accurate answer.

1. Embedding the Query

The user's query, which is a string of text, is passed through the same embedding model that was used during the indexing phase. This converts the query into a vector, $v_q$ , placing it within the same vector space as the document chunk vectors, $v_d$ . This step is significant because it allows us to compare the query's meaning directly against the meaning of the document chunks.

2. Retrieval

The query vector $v_q$ is sent to the VectorStore. The store performs a similarity search, calculating the distance between the query vector and all the document chunk vectors in the index. The "top-k" chunks with the highest similarity scores (i.e., the smallest distance in vector space) are returned. This component is managed by a Retriever, which is configured to fetch the most relevant documents for a given query.

3. Augmentation and Generation

The retrieved text chunks serve as the "augmented" context. They are passed to a PromptTemplate component, which combines them with the original user query into a single formatted string. The prompt typically looks something like this:

"Using the following context, please answer the question.

Context: [Retrieved Chunk 1 Text] [Retrieved Chunk 2 Text] ...

Question: [Original User Query]"

The LLM then synthesizes an answer based only on the information provided in the prompt. This constrains the model to use the factual, up-to-date data from your documents rather than relying solely on its internal, pre-trained knowledge. The result is a response that is grounded, accurate, and specific to your dataset.

By breaking the process into these distinct stages, a RAG system provides an effective and modular way to enhance LLMs with external knowledge. In the following sections, we will explore the practical implementation of each component in this architecture using LangChain.

Build LLM apps faster with Kerb

Cleaner syntax. Built-in debugging. Production-ready from day one.

Built for the AI systems behind ApX Machine Learning

Was this section helpful?

References

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela, 2020 Advances in Neural Information Processing Systems, Vol. 33 DOI: 10.48550/arXiv.2005.11401 - Introduces the concept and initial architecture of Retrieval Augmented Generation.
RAG (Retrieval Augmented Generation) Concepts, LangChain documentation contributors, 2024 (LangChain) - Provides a practical overview of RAG system components within the LangChain framework.
Text Splitters, LangChain, 2024 (LangChain) - Details various strategies and considerations for splitting documents into smaller, semantically coherent chunks for RAG.