A complete question-answering application is built by integrating the components of a Retrieval Augmented Generation (RAG) system. This involves loading documents and creating retrievers. The program's goal is to answer questions about a PDF document by first retrieving relevant information and then synthesizing an answer with an LLM.
This hands-on session will walk you through the entire workflow, reinforcing how each component connects to form a data-aware application.
For this exercise, we will use the paper "Attention Is All You Need," which introduced the Transformer architecture. Our application will load this PDF, process it, and answer specific questions about its contents. You can download the PDF from its arXiv page and save it in your project directory as attention-is-all-you-need.pdf.
First, ensure you have the necessary libraries installed. We will use langchain and its community extensions, openai for the models, pypdf for loading PDF files, langchain-chroma for the vector store, and tiktoken for text tokenization.
pip install langchain langchain-community langchain-openai langchain-chroma pypdf tiktoken
Next, let's set up our Python script. We'll import the required modules and configure the OpenAI API key. Make sure your API key is set as an environment variable for security.
import os
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY_HERE"
Our first task is to load the PDF. We will use the PyPDFLoader for this. After loading, it's important to split the document into smaller chunks. This ensures that the retrieved context fits within the LLM's context window and improves the relevance of search results. We will use the RecursiveCharacterTextSplitter, a good general-purpose choice.
# 1. Load the document
loader = PyPDFLoader("attention-is-all-you-need.pdf")
documents = loader.load()
print(f"Loaded {len(documents)} pages from the PDF.")
# 2. Split the document into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1200,
chunk_overlap=200
)
splits = text_splitter.split_documents(documents)
print(f"Split the document into {len(splits)} chunks.")
Here, chunk_size is set to 1200 characters, and chunk_overlap is set to 200 characters. The overlap helps maintain semantic context between chunks, ensuring that sentences or ideas aren't abruptly cut off.
With our text chunks ready, we need to convert them into numerical vectors (embeddings) so they can be indexed for semantic search. We'll use OpenAIEmbeddings for this. These embeddings are then stored in a Chroma vector store.
The following code initializes the embedding model and creates the vector store from our document splits in a single step.
# 3. Create embeddings and vector store
embedding_model = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
documents=splits,
embedding=embedding_model
)
The Chroma.from_documents method handles the process of creating an embedding for each text chunk and indexing it in the vector database. Now, our document's content is stored and ready for efficient retrieval.
Now we assemble the final chain. In modern LangChain, we combine a "document chain" (which handles the answer synthesis) with a "retrieval chain" (which handles fetching data).
First, we create a retriever from our vector store. A retriever is an interface that, given a query, returns relevant documents.
Next, we define a prompt template instructing the model to use the provided context to answer the question. Finally, we build the chains.
# 4. Initialize the LLM and the Retrieval Chain
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
# Define a prompt template
prompt = ChatPromptTemplate.from_template("""Answer the following question based only on the provided context:
<context>
{context}
</context>
Question: {input}""")
# Create the document chain (synthesizes the answer)
document_chain = create_stuff_documents_chain(llm, prompt)
# Create the retrieval chain (fetches documents and passes them to the document chain)
retriever = vectorstore.as_retriever()
retrieval_chain = create_retrieval_chain(retriever, document_chain)
We use the create_stuff_documents_chain, which takes the retrieved documents, combines them into a single context, and passes that context along with the user's question to the LLM.
The following diagram illustrates the complete RAG workflow we have just built.
The RAG pipeline consists of two main stages. The ingestion stage processes and indexes documents into a vector store. The query stage retrieves relevant documents based on a user query and uses them to generate a final answer.
Our system is now ready. We can invoke the retrieval_chain with a dictionary containing our input. Let's ask a few questions directly related to the content of the paper.
# 5. Ask questions
query1 = "What is the primary architecture proposed in the 'Attention Is All You Need' paper?"
response1 = retrieval_chain.invoke({"input": query1})
print("Query 1:", query1)
print("Answer 1:", response1['answer'])
print("\n" + "="*50 + "\n")
query2 = "Describe the two main sub-layers in the encoder and decoder stacks."
response2 = retrieval_chain.invoke({"input": query2})
print("Query 2:", query2)
print("Answer 2:", response2['answer'])
The output will be generated by the LLM based on the retrieved text chunks from the PDF. It should look similar to this:
Query 1: What is the primary architecture proposed in the 'Attention Is All You Need' paper?
Answer 1: The primary architecture proposed in the 'Attention Is All You Need' paper is the Transformer, a model architecture that avoids recurrence and instead relies entirely on an attention mechanism to draw global dependencies between input and output.
==================================================
Query 2: Describe the two main sub-layers in the encoder and decoder stacks.
Answer 2: In the encoder and decoder stacks, each layer has two main sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. A residual connection is employed around each of the two sub-layers, followed by layer normalization.
The answers are accurate and directly derived from the source document, demonstrating the effectiveness of the RAG approach. The model isn't just relying on its pre-trained knowledge; it's using the provided context to deliver a specific and factual response.
In this practical exercise, you successfully built a complete Retrieval Augmented Generation system. You learned how to chain together document loaders, text splitters, embedding models, vector stores, and LLMs to create a powerful Q&A application for your private documents.
This pattern is one of the most common and effective applications of LLMs today. We encourage you to experiment further by:
chunk_size and chunk_overlap to see how it affects performance.map_reduce or refine for very large documents.With this foundation, you are now equipped to build sophisticated applications that can reason about and interact with your own data sources.
Cleaner syntax. Built-in debugging. Production-ready from day one.
Built for the AI systems behind ApX Machine Learning
Was this section helpful?
© 2026 ApX Machine LearningEngineered with