As we discussed, Retrieval-Augmented Generation (RAG) enhances LLMs by providing them with relevant information retrieved from external data sources during the generation process. The core component enabling this retrieval is the vector store, sometimes called a vector database. This section focuses on setting up and interacting with basic vector stores using Python, forming a foundational piece of your RAG system.
You might wonder why we can't just use a standard database or search engine. While traditional methods search for exact keyword matches, RAG requires understanding the semantic meaning or contextual similarity between a user's query and the information in your data source. This is where vector embeddings and vector stores come in:
Think of it like organizing books in a library not just by title or author (keywords), but by the underlying topics and ideas they discuss (semantic meaning). A vector store allows your application to find the most relevant "idea vectors" quickly.
For many development tasks, prototypes, or applications dealing with moderately sized datasets, simple, often self-contained vector stores are sufficient and easy to manage. Two popular choices in the Python ecosystem are:
We'll focus on ChromaDB for our examples due to its straightforward API and minimal setup requirements.
Let's walk through the process of installing ChromaDB, adding some data, and performing a similarity search.
First, install the necessary library using pip:
pip install chromadb
You might also need an embedding model provider library depending on how you generate embeddings (e.g., sentence-transformers
, openai
, cohere
). We'll assume for now that you have a function or mechanism to convert text into vectors.
You can start using ChromaDB either in-memory (data is lost when the script ends) or by configuring it to persist data to disk.
import chromadb
# In-memory client (easiest to start)
# client = chromadb.Client()
# Persistent client (saves data to disk in the 'my_chroma_db' directory)
client = chromadb.PersistentClient(path="./my_chroma_db")
# Create or get a collection (like a table in a SQL database)
# Collections require an embedding function or explicit embedding vectors.
# For simplicity here, we'll use Chroma's default SentenceTransformer-based
# embedding function. In practice, you'd often configure a specific one.
# Note: This might download a model on first use.
try:
collection = client.get_collection(name="my_documents")
print("Collection 'my_documents' already exists.")
except Exception:
print("Creating collection 'my_documents'...")
# If using Chroma's default embedding function, ensure sentence-transformers
# and torch are installed (`pip install sentence-transformers torch`)
# Or provide your own embedding function: client.create_collection(name="my_documents", embedding_function=my_embedding_function)
collection = client.create_collection(name="my_documents")
print("Collection created.")
Here, PersistentClient
tells ChromaDB to save its data in the specified directory. A collection
is where you'll store your vectors, the original text (documents), and any associated metadata. Chroma requires knowing how to embed text, either by providing an embedding_function
or by adding pre-computed vectors directly. Chroma's default is often sufficient for initial experimentation.
Now, let's add some text documents to our collection. ChromaDB will automatically use its configured embedding function (the default in this case) to convert the text into vectors before storing them.
# Data to add
documents = [
"The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.",
"The Great Wall of China is a series of fortifications made of stone, brick, tamped earth, wood, and other materials.",
"Photosynthesis is a process used by plants and other organisms to convert light energy into chemical energy.",
"The Python programming language is widely used for web development, data science, and artificial intelligence."
]
metadatas = [
{"source": "wiki-landmark", "topic": "architecture"},
{"source": "wiki-landmark", "topic": "architecture"},
{"source": "wiki-biology", "topic": "science"},
{"source": "wiki-programming", "topic": "technology"}
]
ids = ["doc1", "doc2", "doc3", "doc4"] # Unique IDs for each document
# Add documents to the collection
# ChromaDB automatically handles embedding generation if not provided explicitly.
try:
collection.add(
documents=documents,
metadatas=metadatas,
ids=ids
)
print(f"Added {len(ids)} documents to the collection.")
except ValueError as e:
print(f"Error adding documents (perhaps duplicates?): {e}")
# Handle case where IDs might already exist if running script multiple times
# For simplicity, we just print the error here.
We provide the text content (documents
), optional metadatas
(useful for filtering later), and unique ids
for each entry.
The core purpose of the vector store is retrieval. Let's ask a question and find the most relevant documents.
query_text = "What is a popular language for AI?"
n_results = 2 # Number of most similar documents to retrieve
# Perform the similarity search
results = collection.query(
query_texts=[query_text],
n_results=n_results,
# Optional: include=['documents', 'distances', 'metadatas'] # Specify what to return
)
print(f"\nQuery: {query_text}")
print(f"Found {len(results.get('ids', [[]])[0])} results:")
# Process and display results
retrieved_ids = results.get('ids', [[]])[0]
retrieved_docs = results.get('documents', [[]])[0]
retrieved_metadatas = results.get('metadatas', [[]])[0]
retrieved_distances = results.get('distances', [[]])[0]
for i in range(len(retrieved_ids)):
print(f" ID: {retrieved_ids[i]}")
print(f" Distance: {retrieved_distances[i]:.4f}") # Lower distance = more similar
print(f" Metadata: {retrieved_metadatas[i]}")
print(f" Document: {retrieved_docs[i]}")
print("-" * 20)
ChromaDB takes the query_text
, automatically embeds it using the same process as the stored documents, and performs a similarity search (likely cosine similarity or Euclidean distance in the embedding space) to find the n_results
closest vectors. The results include the IDs, the original documents, metadata, and a distance score indicating similarity (lower is typically better). As expected, the query about AI languages should retrieve the document about Python.
Diagram illustrating the flow of data during indexing and querying with a vector store like ChromaDB. Text is converted to embeddings and stored; queries are embedded and compared against stored vectors to find relevant information.
Libraries like LangChain and LlamaIndex often provide higher-level abstractions over vector stores. You typically configure the vector store once (like we did above) and then pass the collection
object or a specialized Retriever
object built from it to the framework.
For example, in LangChain, you might wrap a Chroma collection in a Chroma
vector store object and use it as a retriever in a RAG chain:
# Example Snippet (Conceptual - assumes LangChain is installed)
# from langchain_community.vectorstores import Chroma
# from langchain_core.embeddings import Embeddings # Base class
# Assuming 'client' and 'collection_name' are defined as before
# and 'embedding_function' is a LangChain-compatible embedding object
# vector_store = Chroma(
# client=client,
# collection_name="my_documents",
# embedding_function=embedding_function # e.g., OpenAIEmbeddings()
# )
# retriever = vector_store.as_retriever(search_kwargs={"k": 3})
# 'retriever' can now be used in a LangChain RAG chain
Similarly, LlamaIndex uses vector stores as part of its VectorStoreIndex
to manage the storage and retrieval of node embeddings.
While excellent for getting started, keep these points in mind:
For smaller projects or initial development, stores like ChromaDB offer a great balance of simplicity and capability. As your needs grow, you might explore more complex self-hosted options (e.g., Weaviate, Qdrant) or managed cloud services (e.g., Pinecone, Google Vertex AI Matching Engine, Azure AI Search).
With a basic vector store set up and populated, you now have the core retrieval mechanism ready. The next step is to integrate this retriever into a complete RAG pipeline, combining it with an LLM to generate informed responses based on the retrieved context.
© 2025 ApX Machine Learning