Building a Question-Answering Chain

Building a Retrieval Augmented Generation system uses several essential components. These include a retriever designed to fetch relevant documents from a vector store and an LLM capable of generating text. Combining these elements creates a cohesive application, allowing it to take a user's question and produce a well-grounded answer.

LangChain provides specific helper functions to construct this workflow. The modern standard is to build a RetrievalChain, which manages the retrieval of documents and the subsequent call to the language model. This automates the "retrieve-then-read" pattern that defines RAG.

The process is straightforward: the user's query is first passed to the retriever. The retriever fetches the most relevant documents from the vector store. These documents are then formatted into a prompt along with the original query and sent to the LLM, which generates the final answer based on the provided context.

The flow within a Retrieval Chain. The user's query serves two purposes: finding relevant documents and forming part of the final prompt sent to the LLM.

Implementing the Retrieval Chain

Creating a retrieval chain involves two steps: first, creating a chain that combines documents into a prompt (the question-answering part), and second, connecting that to the retriever.

We use two factory functions:

create_stuff_documents_chain: This takes a list of documents, formats them into a prompt, and passes them to the LLM. This implements the standard "stuff" strategy.
create_retrieval_chain: This connects the retriever to the document chain, handling the fetching and passing of documents automatically.

Let's see this in action. Assuming you have an llm instance (like ChatOpenAI) and a retriever created from your vector store, you can build the chain like this:

from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Assume you have already loaded documents and created a vector store 'db'
# For example:
# db = Chroma.from_documents(split_docs, OpenAIEmbeddings())

# 1. Instantiate the LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# 2. Create the retriever from the vector store
retriever = db.as_retriever(search_kwargs={"k": 3})

# 3. Create the prompt template
system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

# 4. Create the chains
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

# 5. Ask a question
question = "What are the limitations of LLMs without RAG?"
response = rag_chain.invoke({"input": question})

print(response["answer"])

When you run this code, LangChain executes the entire RAG pipeline. The retriever finds the top 3 documents related to the question, the question_answer_chain inserts them into the {context} placeholder of the prompt, and the llm uses this to answer. The result is an answer grounded in the information from your specific documents.

Document Combination Strategies

The example above uses the "stuff" strategy, implemented via create_stuff_documents_chain. This is the most common approach, but depending on your use case and document volume, you might encounter other patterns:

Stuff: This is the most direct approach. It takes all the retrieved documents, inserts them into a prompt template, and sends the entire block of text to the LLM in a single API call.
- Pros: Fast and efficient, as it only requires one call to the LLM. It allows the model to see all the context at once.
- Cons: Can easily exceed the model's context window limit if you retrieve many documents or if the documents themselves are large.
Map-Reduce: This method is designed for larger numbers of documents. It first runs an initial prompt on each document individually (map step). Then, the outputs from each document are combined and summarized in a separate call (reduce step).
- Pros: Can scale to handle a very large number of documents, far beyond what would fit in a single context window.
- Cons: Requires multiple calls to the LLM, making it slower and more expensive. It may also lose some detail as each document is processed in isolation before the final combination.
Refine: This approach also processes documents iteratively. It runs a prompt on the first document to generate an initial answer. Then, it loops through the remaining documents, feeding the previous answer and the new document to the LLM to progressively refine the answer.
- Pros: Can incorporate more detail than map_reduce by building upon previous answers.
- Cons: Also requires multiple LLM calls, and the final output can be influenced by the order in which documents are processed.

For most standard Q&A tasks, the "stuff" strategy is the best starting point due to its simplicity and performance. You should only explore other architectures if you consistently encounter context length errors.

Accessing Source Documents

A significant benefit of RAG is the ability to trace an answer back to its source material. This is important for building trust and allowing users to verify the information.

The create_retrieval_chain automatically preserves the retrieved documents and includes them in the output dictionary under the context key. You do not need any extra configuration to access them.

# The input is a dictionary with 'input'
question = "How do vector stores enable semantic search?"
result = rag_chain.invoke({"input": question})

# The output is a dictionary containing the answer and the context (source documents)
print("Answer:")
print(result["answer"])
print("\nSource Documents:")
for doc in result["context"]:
    print(f"- Page Content: {doc.page_content[:150]}...")
    print(f"  Source: {doc.metadata.get('source', 'N/A')}\n")

The output from this chain includes the generated answer and a context list. This structured output is incredibly useful for building user interfaces that show citations or allow users to view the original text.

With the ability to construct a complete question-answering chain and inspect its sources, you are now equipped to build powerful, data-aware applications. The next section provides a hands-on practical exercise to solidify these skills by building a Q&A system over a document of your choice.

Build LLM apps faster with Kerb

Cleaner syntax. Built-in debugging. Production-ready from day one.

Built for the AI systems behind ApX Machine Learning

Was this section helpful?

References

Retrieval QA, LangChain, 2024 (LangChain) - Official LangChain documentation providing detailed information and API reference for implementing question-answering chains, covering various chain types and configuration options.
Generative AI with LangChain: Build Large Language Model Applications with Python, Ben Auffarth, 2024 (O'Reilly Media) - A practical guide for building LLM applications using LangChain, with specific examples and explanations related to RAG implementations and chain type selection.
Prompt engineering guide, OpenAI, 2024 (OpenAI) - Offers practical advice and best practices for constructing effective prompts, which is important for understanding how retrieved documents are integrated into the language model's context.