With the retriever component ready to fetch relevant information (as discussed in Chapter 2 and prepared in Chapter 3), the next logical step is integrating the "Generation" part of Retrieve-Augmented Generation. This involves connecting a Large Language Model (LLM) that will synthesize the user's query and the retrieved context into a coherent final answer.
The LLM acts as the reasoning engine. It doesn't just repeat the retrieved text; it uses it as supplementary knowledge to formulate a response that directly addresses the original question, grounded in the provided context.
There are two primary ways to integrate an LLM into your RAG pipeline:
transformers
or specialized serving frameworks like Ollama or vLLM.Let's look at how to implement these approaches.
Using an API is often the quickest way to get started. The provider manages the model hosting, scaling, and maintenance. Your application sends the augmented prompt (query + context) to the API endpoint and receives the generated text back.
Steps:
gpt-3.5-turbo
, claude-3-opus
, gemini-pro
).pip
to install the necessary Python client library (e.g., pip install openai
, pip install anthropic
).create
, complete
, or generate
).Example (OpenAI Integration):
# Note: Requires 'openai' library installed and OPENAI_API_KEY environment variable set.
import os
from openai import OpenAI
# 1. Instantiate the client (authenticates using environment variable)
try:
client = OpenAI()
# api_key can also be explicitly passed: OpenAI(api_key="YOUR_API_KEY")
except Exception as e:
print(f"Error initializing OpenAI client: {e}")
# Handle error appropriately (e.g., exit, log, raise)
exit()
# 2. Prepare the augmented prompt (example structure)
user_query = "What were the main findings of the climate report?"
retrieved_context = """
Document Snippet 1: The report highlights a significant increase in global average temperatures...
Document Snippet 2: Key findings include accelerated sea-level rise and more frequent extreme weather events...
"""
augmented_prompt = f"""
Based on the following context, answer the user's query.
Context:
{retrieved_context}
Query: {user_query}
Answer:
"""
# 3. Make the API call
try:
response = client.chat.completions.create(
model="gpt-3.5-turbo", # Or another suitable model
messages=[
{"role": "system", "content": "You are a helpful assistant responding based on provided context."},
{"role": "user", "content": augmented_prompt}
],
temperature=0.7, # Controls randomness (creativity vs. determinism)
max_tokens=150 # Limits the length of the generated response
)
# 4. Process the response
if response.choices:
generated_text = response.choices[0].message.content.strip()
print("LLM Response:")
print(generated_text)
else:
print("No response generated.")
except Exception as e:
print(f"Error during OpenAI API call: {e}")
# Handle API errors (e.g., rate limits, authentication issues)
Considerations for APIs:
Running models locally gives you more control over the environment and data privacy, but requires managing the computational resources and model setup. Libraries like Hugging Face's transformers
make loading and running many open-source models relatively straightforward.
Steps:
transformers
and a backend like PyTorch (torch
) or TensorFlow (tensorflow
). You might need additional dependencies depending on the model. (pip install transformers torch
)generate
function.Example (Hugging Face transformers
Integration):
# Note: Requires 'transformers' and 'torch' (or 'tensorflow') installed.
# May require significant RAM/VRAM depending on the model.
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import torch # Or import tensorflow as tf
# 1. Choose a model (example: a smaller, manageable model)
model_name = "gpt2" # Replace with a larger/better model if resources allow, e.g., "mistralai/Mistral-7B-Instruct-v0.1"
# 2. Load Model and Tokenizer (downloads weights on first run)
try:
# Using pipeline for simpler interface (handles tokenization/decoding)
# Specify device: 'cuda' for GPU (if available & configured), 'cpu' otherwise
device = 0 if torch.cuda.is_available() else -1 # pipeline convention: device=0 for first GPU, -1 for CPU
generator_pipeline = pipeline(
"text-generation",
model=model_name,
device=device
)
print(f"Loaded model {model_name} onto device: {'GPU' if device == 0 else 'CPU'}")
# Alternatively, load manually for more control:
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModelForCausalLM.from_pretrained(model_name)
# model.to('cuda' if torch.cuda.is_available() else 'cpu') # Move model to device
except Exception as e:
print(f"Error loading model {model_name}: {e}")
# Handle errors (e.g., model not found, insufficient memory)
exit()
# 3. Prepare the augmented prompt
user_query = "What is the capital of France?"
retrieved_context = "France is a country in Western Europe. Paris is its capital and largest city."
# Basic prompt template
augmented_prompt = f"""
Context: {retrieved_context}
Question: {user_query}
Answer: """
# 4. Generate Text using the pipeline
try:
# Pipeline handles tokenization, generation, and decoding
responses = generator_pipeline(
augmented_prompt,
max_new_tokens=50, # Limit the number of tokens generated *after* the prompt
num_return_sequences=1,
eos_token_id=generator_pipeline.tokenizer.eos_token_id # Stop generation at end-of-sequence token
)
generated_text = responses[0]['generated_text']
# Often, the pipeline output includes the prompt. We might want only the answer part.
# Simple way: find the prompt end and take text after it.
answer_part = generated_text[len(augmented_prompt):].strip()
print("\nLLM Response (Answer Part):")
print(answer_part)
# --- Manual generation (if not using pipeline) ---
# inputs = tokenizer(augmented_prompt, return_tensors="pt").to(model.device)
# outputs = model.generate(**inputs, max_new_tokens=50)
# decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
# print("\nLLM Response (Manual):")
# print(decoded_output)
except Exception as e:
print(f"Error during text generation: {e}")
# Handle generation errors
Considerations for Local Models:
Frameworks like LangChain and LlamaIndex provide higher-level abstractions that simplify generator integration. You typically configure the LLM you want to use (whether API-based or local) within the framework's objects.
Example (LangChain):
# Note: Example, requires LangChain and provider libraries installed.
# --- Configuration for an API-based LLM ---
# from langchain_openai import ChatOpenAI
# llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7, openai_api_key="YOUR_API_KEY")
# --- Configuration for a local LLM via Hugging Face ---
# from langchain_community.llms import HuggingFacePipeline
# llm = HuggingFacePipeline.from_model_id(
# model_id="gpt2",
# task="text-generation",
# pipeline_kwargs={"max_new_tokens": 100},
# device=0 # Use GPU 0 if available
# )
# --- Later in the RAG chain ---
# Assume 'retriever' is configured and 'prompt_template' is defined
# from langchain_core.runnables import RunnablePassthrough
# from langchain_core.output_parsers import StrOutputParser
# rag_chain = (
# {"context": retriever, "question": RunnablePassthrough()} # Fetch context based on input question
# | prompt_template # Format the prompt
# | llm # Pass augmented prompt to the configured LLM
# | StrOutputParser() # Parse the LLM output string
# )
# result = rag_chain.invoke("What is the capital of France?")
# print(result)
These frameworks handle the boilerplate code for API calls or local model interactions, letting you focus on the pipeline logic. We'll see more of this structure when combining components in the next section.
Integrating the generator is a central step. Whether you opt for the convenience of APIs or the control of local models, the goal remains the same: provide the LLM with both the user's question and the relevant context retrieved from your knowledge base, enabling it to generate an informed and accurate response. Now that we have ways to implement both the retriever and the generator, we are ready to connect them.
© 2025 ApX Machine Learning