Hands-on Practical: Q&A over Your Documents

A complete question-answering application is built by integrating the components of a Retrieval Augmented Generation (RAG) system. This involves loading documents and creating retrievers. The program's goal is to answer questions about a PDF document by first retrieving relevant information and then synthesizing an answer with an LLM.

This hands-on session will walk you through the entire workflow, reinforcing how each component connects to form a data-aware application.

Project Goal: Q&A over a Research Paper

For this exercise, we will use the paper "Attention Is All You Need," which introduced the Transformer architecture. Our application will load this PDF, process it, and answer specific questions about its contents. You can download the PDF from its arXiv page and save it in your project directory as attention-is-all-you-need.pdf.

Step 1: Setting Up the Environment

First, ensure you have the necessary libraries installed. We will use langchain and its community extensions, openai for the models, pypdf for loading PDF files, langchain-chroma for the vector store, and tiktoken for text tokenization.

pip install langchain langchain-community langchain-openai langchain-chroma pypdf tiktoken

Next, let's set up our Python script. We'll import the required modules and configure the OpenAI API key. Make sure your API key is set as an environment variable for security.

import os
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY_HERE"

Step 2: Loading and Splitting the Document

Our first task is to load the PDF. We will use the PyPDFLoader for this. After loading, it's important to split the document into smaller chunks. This ensures that the retrieved context fits within the LLM's context window and improves the relevance of search results. We will use the RecursiveCharacterTextSplitter, a good general-purpose choice.

# 1. Load the document
loader = PyPDFLoader("attention-is-all-you-need.pdf")
documents = loader.load()
print(f"Loaded {len(documents)} pages from the PDF.")

# 2. Split the document into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1200, 
    chunk_overlap=200
)
splits = text_splitter.split_documents(documents)
print(f"Split the document into {len(splits)} chunks.")

Here, chunk_size is set to 1200 characters, and chunk_overlap is set to 200 characters. The overlap helps maintain semantic context between chunks, ensuring that sentences or ideas aren't abruptly cut off.

Step 3: Creating Embeddings and the Vector Store

With our text chunks ready, we need to convert them into numerical vectors (embeddings) so they can be indexed for semantic search. We'll use OpenAIEmbeddings for this. These embeddings are then stored in a Chroma vector store.

The following code initializes the embedding model and creates the vector store from our document splits in a single step.

# 3. Create embeddings and vector store
embedding_model = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
    documents=splits, 
    embedding=embedding_model
)

The Chroma.from_documents method handles the process of creating an embedding for each text chunk and indexing it in the vector database. Now, our document's content is stored and ready for efficient retrieval.

Step 4: Building the Retrieval Chain

Now we assemble the final chain. In modern LangChain, we combine a "document chain" (which handles the answer synthesis) with a "retrieval chain" (which handles fetching data).

First, we create a retriever from our vector store. A retriever is an interface that, given a query, returns relevant documents.

Next, we define a prompt template instructing the model to use the provided context to answer the question. Finally, we build the chains.

# 4. Initialize the LLM and the Retrieval Chain
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# Define a prompt template
prompt = ChatPromptTemplate.from_template("""Answer the following question based only on the provided context:

<context>
{context}
</context>

Question: {input}""")

# Create the document chain (synthesizes the answer)
document_chain = create_stuff_documents_chain(llm, prompt)

# Create the retrieval chain (fetches documents and passes them to the document chain)
retriever = vectorstore.as_retriever()
retrieval_chain = create_retrieval_chain(retriever, document_chain)

We use the create_stuff_documents_chain, which takes the retrieved documents, combines them into a single context, and passes that context along with the user's question to the LLM.

The following diagram illustrates the complete RAG workflow we have just built.

The RAG pipeline consists of two main stages. The ingestion stage processes and indexes documents into a vector store. The query stage retrieves relevant documents based on a user query and uses them to generate a final answer.

Step 5: Asking Questions

Our system is now ready. We can invoke the retrieval_chain with a dictionary containing our input. Let's ask a few questions directly related to the content of the paper.

# 5. Ask questions
query1 = "What is the primary architecture proposed in the 'Attention Is All You Need' paper?"
response1 = retrieval_chain.invoke({"input": query1})
print("Query 1:", query1)
print("Answer 1:", response1['answer'])

print("\n" + "="*50 + "\n")

query2 = "Describe the two main sub-layers in the encoder and decoder stacks."
response2 = retrieval_chain.invoke({"input": query2})
print("Query 2:", query2)
print("Answer 2:", response2['answer'])

Expected Output

The output will be generated by the LLM based on the retrieved text chunks from the PDF. It should look similar to this:

Query 1: What is the primary architecture proposed in the 'Attention Is All You Need' paper?
Answer 1: The primary architecture proposed in the 'Attention Is All You Need' paper is the Transformer, a model architecture that avoids recurrence and instead relies entirely on an attention mechanism to draw global dependencies between input and output.

==================================================

Query 2: Describe the two main sub-layers in the encoder and decoder stacks.
Answer 2: In the encoder and decoder stacks, each layer has two main sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. A residual connection is employed around each of the two sub-layers, followed by layer normalization.

The answers are accurate and directly derived from the source document, demonstrating the effectiveness of the RAG approach. The model isn't just relying on its pre-trained knowledge; it's using the provided context to deliver a specific and factual response.

Summary and Next Steps

In this practical exercise, you successfully built a complete Retrieval Augmented Generation system. You learned how to chain together document loaders, text splitters, embedding models, vector stores, and LLMs to create a powerful Q&A application for your private documents.

This pattern is one of the most common and effective applications of LLMs today. We encourage you to experiment further by:

Using a different PDF or document type.
Adjusting the chunk_size and chunk_overlap to see how it affects performance.
Trying different chain types like map_reduce or refine for very large documents.

With this foundation, you are now equipped to build sophisticated applications that can reason about and interact with your own data sources.

Build LLM apps faster with Kerb

Cleaner syntax. Built-in debugging. Production-ready from day one.

Built for the AI systems behind ApX Machine Learning

Was this section helpful?

References

Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, 2017 Advances in Neural Information Processing Systems 30 (NeurIPS) DOI: 10.5555/3295222.3295349 - The seminal paper introducing the Transformer architecture, which forms the basis for most modern large language models, and the specific document used in the practical.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Narsimha Chilkuri, Michael Grave, Pasquale Minervini, and Sebastian Riedel, 2020 Advances in Neural Information Processing Systems 33, Vol. 33 (NeurIPS) DOI: 10.5555/3495289.3495444 - Introduces the Retrieval Augmented Generation (RAG) architecture, which combines parametric and non-parametric memory, significantly enhancing knowledge-intensive natural language processing tasks.
LangChain Python Library Documentation, LangChain Developers, 2024 - The official documentation for the LangChain Python library, offering comprehensive guides, tutorials, and API references for building LLM applications.
OpenAI Embeddings Guide, OpenAI, 2024 (OpenAI) - The official guide from OpenAI explaining the concept of text embeddings and how to use their embedding models, which are central to vector search in RAG systems.