Before we can assemble the components of our RAG system, we need to prepare our development workspace. This involves setting up an isolated Python environment and installing the necessary libraries that provide the building blocks for retrieval, generation, and orchestration. Ensuring you have the correct tools installed is the first step towards building a functional pipeline.
It's standard practice in Python development to use virtual environments to manage project dependencies. This prevents conflicts between libraries required by different projects. If you're not already familiar with venv
, it's Python's built-in tool for creating lightweight virtual environments.
Open your terminal or command prompt and navigate to your project directory. Then, create a virtual environment (we'll name it rag-env
here, but you can choose any name):
python -m venv rag-env
Next, activate the environment. The activation command differs slightly depending on your operating system:
source rag-env/bin/activate
rag-env\Scripts\activate.bat
rag-env\Scripts\Activate.ps1
Once activated, your terminal prompt will usually change to indicate that you are working inside the rag-env
environment. All packages installed from now on will be local to this environment.
With the virtual environment active, we can install the Python packages needed for our basic RAG pipeline. We'll use pip
, the Python package installer.
For this introductory course, we will utilize libraries that are commonly used in the RAG ecosystem. These include:
sentence-transformers
is a popular choice for accessing high-quality open models that can run locally. Alternatively, embedding APIs from providers like OpenAI can be used.ChromaDB
is a simple, local vector store suitable for getting started. Other options like FAISS
(from Facebook AI) are also widely used.pypdf
is needed for loading PDF documents.python-dotenv
for managing API keys securely and tiktoken
for counting tokens specifically for OpenAI models.Let's install these using a single pip
command:
pip install langchain langchain-openai langchain-community sentence-transformers chromadb python-dotenv tiktoken pypdf
Let's briefly break down what we've installed:
langchain
: The core library for the LangChain framework.langchain-openai
: Provides specific integrations for OpenAI models (LLMs and embeddings).langchain-community
: Contains many community-maintained integrations, including document loaders and vector stores like ChromaDB.sentence-transformers
: Used for generating text embeddings using models from the Hugging Face Hub.chromadb
: The client library for the Chroma vector database.python-dotenv
: Utility to load environment variables from a .env
file (useful for API keys).tiktoken
: OpenAI's tokenizer, helpful for counting tokens to manage context windows.pypdf
: A library for reading text content from PDF files.Depending on your specific needs or if you choose alternative components (e.g., a different LLM provider or vector store), you might need to install additional packages. The documentation for LangChain or your chosen libraries will specify the required installations.
Many LLMs and embedding services (like OpenAI's) require API keys for authentication. It's important not to hardcode these keys directly in your scripts. The standard approach is to use environment variables.
The python-dotenv
library helps manage this. Create a file named .env
in the root directory of your project (alongside your Python scripts). Add your API keys to this file:
# .env file
OPENAI_API_KEY="sk-YourSecretOpenAIKeyGoesHere"
# Add other keys if needed, e.g.:
# HUGGINGFACEHUB_API_TOKEN="hf_YourHuggingFaceToken"
Make sure to replace "sk-YourSecretOpenAIKeyGoesHere"
with your actual key. Important: Add .env
to your .gitignore
file to prevent accidentally committing your secret keys to version control.
In your Python code, you can load these variables early in your script:
import os
from dotenv import load_dotenv
# Load environment variables from .env file
load_dotenv()
# Access the API key
openai_api_key = os.getenv("OPENAI_API_KEY")
# You can now use this variable when configuring LLM clients
# Example:
# llm = OpenAI(api_key=openai_api_key)
if openai_api_key is None:
print("Warning: OPENAI_API_KEY not found in environment variables.")
# Handle the case where the key is missing, perhaps by exiting or using a fallback.
Loading the variables at the start makes them available throughout your application via os.getenv()
.
You can quickly verify that the core libraries are installed correctly by trying to import them in a Python interpreter or script:
import langchain
import langchain_openai
import langchain_community
import sentence_transformers
import chromadb
import tiktoken
import pypdf # Note: pypdf is the package name, used like this in code
print("Core RAG libraries imported successfully!")
If this code runs without ModuleNotFoundError
, your environment is likely set up correctly with the essential packages.
With the environment prepared and necessary libraries installed, we are now ready to move on to the first functional part of our pipeline: implementing the retriever component, which involves loading data, generating embeddings, and setting up the vector store for similarity search.
© 2025 ApX Machine Learning