All Courses

Setting Up a Basic Vector Store

As we discussed, Retrieval-Augmented Generation (RAG) enhances LLMs by providing them with relevant information retrieved from external data sources during the generation process. The core component enabling this retrieval is the vector store, sometimes called a vector database. This section focuses on setting up and interacting with basic vector stores using Python, forming a foundational piece of your RAG system.

Why a Vector Store?

You might wonder why we can't just use a standard database or search engine. While traditional methods search for exact keyword matches, RAG requires understanding the semantic meaning or contextual similarity between a user's query and the information in your data source. This is where vector embeddings and vector stores come in:

Storing Embeddings: Vector stores are specifically designed to store high-dimensional numerical vectors (embeddings) that represent the semantic meaning of text passages, images, or other data types.
Efficient Similarity Search: Their primary function is to perform fast and efficient similarity searches. Given a query vector (representing the user's question), the vector store can quickly find the vectors in its index that are "closest" in the embedding space, typically using Approximate Nearest Neighbor (ANN) algorithms. These closest vectors correspond to the most semantically relevant pieces of information in your original data.

Think of it like organizing books in a library not just by title or author (keywords), but by the underlying topics and ideas they discuss (semantic meaning). A vector store allows your application to find the most relevant "idea vectors" quickly.

Choosing a Basic Vector Store

For many development tasks, prototypes, or applications dealing with moderately sized datasets, simple, often self-contained vector stores are sufficient and easy to manage. Two popular choices in the Python ecosystem are:

ChromaDB: An open-source embedding database designed for ease of use. It can run in-memory, persist to disk, or operate in a client/server mode.
FAISS (Facebook AI Similarity Search): A highly optimized library for efficient similarity search and clustering of dense vectors. It's often used as the backend engine by other libraries but can be used directly.

We'll focus on ChromaDB for our examples due to its straightforward API and minimal setup requirements.

Setting Up and Using ChromaDB

Let's walk through the process of installing ChromaDB, adding some data, and performing a similarity search.

1. Installation

First, install the necessary library using pip:

pip install chromadb

You might also need an embedding model provider library depending on how you generate embeddings (e.g., sentence-transformers, openai, cohere). We'll assume for now that you have a function or mechanism to convert text into vectors.

2. Initialization

You can start using ChromaDB either in-memory (data is lost when the script ends) or by configuring it to persist data to disk.

import chromadb

# In-memory client (easiest to start)
# client = chromadb.Client()

# Persistent client (saves data to disk in the 'my_chroma_db' directory)
client = chromadb.PersistentClient(path="./my_chroma_db")

# Create or get a collection (like a table in a SQL database)
# Collections require an embedding function or explicit embedding vectors.
# For simplicity here, we'll use Chroma's default SentenceTransformer-based
# embedding function. In practice, you'd often configure a specific one.
# Note: This might download a model on first use.
try:
    collection = client.get_collection(name="my_documents")
    print("Collection 'my_documents' already exists.")
except Exception:
    print("Creating collection 'my_documents'...")
    # If using Chroma's default embedding function, ensure sentence-transformers
    # and torch are installed (`pip install sentence-transformers torch`)
    # Or provide your own embedding function: client.create_collection(name="my_documents", embedding_function=my_embedding_function)
    collection = client.create_collection(name="my_documents")
    print("Collection created.")

Here, PersistentClient tells ChromaDB to save its data in the specified directory. A collection is where you'll store your vectors, the original text (documents), and any associated metadata. Chroma requires knowing how to embed text, either by providing an embedding_function or by adding pre-computed vectors directly. Chroma's default is often sufficient for initial experimentation.

3. Adding Data (Indexing)

Now, let's add some text documents to our collection. ChromaDB will automatically use its configured embedding function (the default in this case) to convert the text into vectors before storing them.

# Data to add
documents = [
    "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.",
    "The Great Wall of China is a series of fortifications made of stone, brick, tamped earth, wood, and other materials.",
    "Photosynthesis is a process used by plants and other organisms to convert light energy into chemical energy.",
    "The Python programming language is widely used for web development, data science, and artificial intelligence."
]
metadatas = [
    {"source": "wiki-landmark", "topic": "architecture"},
    {"source": "wiki-landmark", "topic": "architecture"},
    {"source": "wiki-biology", "topic": "science"},
    {"source": "wiki-programming", "topic": "technology"}
]
ids = ["doc1", "doc2", "doc3", "doc4"] # Unique IDs for each document

# Add documents to the collection
# ChromaDB automatically handles embedding generation if not provided explicitly.
try:
    collection.add(
        documents=documents,
        metadatas=metadatas,
        ids=ids
    )
    print(f"Added {len(ids)} documents to the collection.")
except ValueError as e:
    print(f"Error adding documents (perhaps duplicates?): {e}")
    # Handle case where IDs might already exist if running script multiple times
    # For simplicity, we just print the error here.

We provide the text content (documents), optional metadatas (useful for filtering later), and unique ids for each entry.

4. Querying (Similarity Search)

The core purpose of the vector store is retrieval. Let's ask a question and find the most relevant documents.

query_text = "What is a popular language for AI?"
n_results = 2 # Number of most similar documents to retrieve

# Perform the similarity search
results = collection.query(
    query_texts=[query_text],
    n_results=n_results,
    # Optional: include=['documents', 'distances', 'metadatas'] # Specify what to return
)

print(f"\nQuery: {query_text}")
print(f"Found {len(results.get('ids', [[]])[0])} results:")

# Process and display results
retrieved_ids = results.get('ids', [[]])[0]
retrieved_docs = results.get('documents', [[]])[0]
retrieved_metadatas = results.get('metadatas', [[]])[0]
retrieved_distances = results.get('distances', [[]])[0]

for i in range(len(retrieved_ids)):
    print(f"  ID: {retrieved_ids[i]}")
    print(f"  Distance: {retrieved_distances[i]:.4f}") # Lower distance = more similar
    print(f"  Metadata: {retrieved_metadatas[i]}")
    print(f"  Document: {retrieved_docs[i]}")
    print("-" * 20)

ChromaDB takes the query_text, automatically embeds it using the same process as the stored documents, and performs a similarity search (likely cosine similarity or Euclidean distance in the embedding space) to find the n_results closest vectors. The results include the IDs, the original documents, metadata, and a distance score indicating similarity (lower is typically better). As expected, the query about AI languages should retrieve the document about Python.

Diagram illustrating the flow of data during indexing and querying with a vector store like ChromaDB. Text is converted to embeddings and stored; queries are embedded and compared against stored vectors to find relevant information.

Integration with Frameworks

Libraries like LangChain and LlamaIndex often provide higher-level abstractions over vector stores. You typically configure the vector store once (like we did above) and then pass the collection object or a specialized Retriever object built from it to the framework.

For example, in LangChain, you might wrap a Chroma collection in a Chroma vector store object and use it as a retriever in a RAG chain:

# Example Snippet (assumes LangChain is installed)
# from langchain_community.vectorstores import Chroma
# from langchain_core.embeddings import Embeddings # Base class

# Assuming 'client' and 'collection_name' are defined as before
# and 'embedding_function' is a LangChain-compatible embedding object
# vector_store = Chroma(
#     client=client,
#     collection_name="my_documents",
#     embedding_function=embedding_function # e.g., OpenAIEmbeddings()
# )
# retriever = vector_store.as_retriever(search_kwargs={"k": 3})

# 'retriever' can now be used in a LangChain RAG chain

Similarly, LlamaIndex uses vector stores as part of its VectorStoreIndex to manage the storage and retrieval of node embeddings.

Considerations for Basic Vector Stores

While excellent for getting started, keep these points in mind:

Scalability: In-memory or simple file-based persistence might struggle with billions of vectors or very high query loads.
Management: You are responsible for managing the underlying storage, backups, and potential scaling needs.
Features: They might lack advanced features found in dedicated, managed vector databases like real-time updates, granular access control, or sophisticated filtering capabilities.

For smaller projects or initial development, stores like ChromaDB offer a great balance of simplicity and capability. As your needs grow, you might explore more complex self-hosted options (e.g., Weaviate, Qdrant) or managed cloud services (e.g., Pinecone, Google Vertex AI Matching Engine, Azure AI Search).

With a basic vector store set up and populated, you now have the core retrieval mechanism ready. The next step is to integrate this retriever into a complete RAG pipeline, combining it with an LLM to generate informed responses based on the retrieved context.

Was this section helpful?