Theory provides a solid foundation, but hands-on practice is essential for understanding how vector databases operate. In this practical exercise, we'll move beyond concepts and interact directly with a vector database. We'll use a popular, easy-to-set-up library to perform the fundamental Create, Read, Update, and Delete (CRUD) operations discussed earlier. This will solidify your understanding of how vectors and their associated metadata are managed.
We will use chromadb
, an open-source vector database designed for simplicity and ease of use, especially for getting started locally. While other powerful options like Qdrant, Weaviate, and Milvus exist (and we'll touch upon them later), ChromaDB allows us to quickly set up an environment and focus on the core interactions.
First, ensure you have Python installed (version 3.8 or higher is recommended). You'll need to install the chromadb
library and a sentence transformer model for embedding text data. Open your terminal or command prompt and run:
pip install chromadb sentence-transformers
This command installs ChromaDB and the sentence-transformers
library, which provides easy access to pre-trained models for creating text embeddings.
ChromaDB can run in memory or persist data to disk. For this exercise, we'll use a persistent client that saves data to a local directory.
import chromadb
import uuid # To generate unique IDs
# Set up a persistent client. Data will be stored in the 'my_vector_db' directory.
client = chromadb.PersistentClient(path="./my_vector_db")
print("ChromaDB client initialized.")
# You can verify the storage path:
# print(f"Storage path: {client.settings.persist_directory}")
This code initializes a client instance that will store its data in a subdirectory named my_vector_db
within your current working directory. If the directory doesn't exist, ChromaDB will create it.
In ChromaDB (and many other vector databases), data is organized into "collections," which are analogous to tables in relational databases or indices in search engines. Each collection typically holds vectors of the same dimensionality and often uses a specific distance metric.
Let's create a collection named documents_collection
. ChromaDB allows specifying an embedding function directly, simplifying the process as it will handle text embedding automatically.
from chromadb.utils import embedding_functions
# Use a pre-built Sentence Transformer model for embeddings
# This model works well for general-purpose semantic search
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
# Create the collection, specifying the embedding function.
# If the collection already exists, use get_or_create_collection.
collection_name = "documents_collection"
try:
client.delete_collection(name=collection_name) # Delete if exists from previous run
print(f"Collection '{collection_name}' deleted.")
except:
pass # Collection doesn't exist, ignore the error
collection = client.create_collection(
name=collection_name,
embedding_function=sentence_transformer_ef,
metadata={"hnsw:space": "cosine"} # Specify cosine distance (common for text)
)
print(f"Collection '{collection.name}' created successfully.")
Here, we first attempt to delete any existing collection with the same name to ensure a clean start. Then, we create the collection, passing our chosen embedding function (all-MiniLM-L6-v2
via sentence-transformers
) and specifying cosine
similarity as the distance metric using the metadata
parameter (specifically for the HNSW index ChromaDB uses by default).
Now, let's add some data. Each item in a ChromaDB collection needs:
id
.documents
, which are text strings).metadata
(key-value pairs associated with the vector).ChromaDB's add
or upsert
methods handle embedding the documents automatically using the function we provided during collection creation. upsert
is often preferred as it adds new items or updates existing ones if the ID already exists.
# Sample data: documents with associated metadata
docs = [
"The Catcher in the Rye is a classic novel by J.D. Salinger.",
"Artificial intelligence is transforming many industries.",
"Paris is the capital city of France, known for the Eiffel Tower.",
"To Kill a Mockingbird explores themes of justice and prejudice.",
"Machine learning algorithms learn patterns from data.",
"The Louvre Museum in Paris houses famous works of art."
]
metadata = [
{'genre': 'Fiction', 'year': 1951, 'topic': 'Literature'},
{'genre': 'Non-fiction', 'year': 2023, 'topic': 'Technology'},
{'genre': 'Non-fiction', 'year': 1889, 'topic': 'Geography'}, # Year Eiffel Tower completed
{'genre': 'Fiction', 'year': 1960, 'topic': 'Literature'},
{'genre': 'Non-fiction', 'year': 2022, 'topic': 'Technology'},
{'genre': 'Non-fiction', 'year': 1793, 'topic': 'Art'} # Year Louvre opened
]
# Generate unique IDs for each document
ids = [str(uuid.uuid4()) for _ in docs]
# Add the data to the collection
collection.add(
documents=docs,
metadatas=metadata,
ids=ids
)
print(f"Added {collection.count()} items to the collection.")
# Verify one item was added (optional)
# print(collection.get(ids=[ids[0]]))
We've prepared lists of documents, corresponding metadata dictionaries, and unique IDs. The collection.add
method takes these lists. ChromaDB processes the documents
, generates embeddings using the all-MiniLM-L6-v2
model, and stores the embeddings along with the documents, metadata, and IDs.
The primary operation in a vector database is similarity search. We provide a query (text in this case), ChromaDB embeds it using the same embedding function, and then finds the stored vectors closest to the query vector based on the chosen distance metric (cosine similarity).
# Query the collection
query_texts = ["Tell me about famous books.", "What is AI?"]
results = collection.query(
query_texts=query_texts,
n_results=2, # Ask for the top 2 most similar results for each query
include=['documents', 'distances', 'metadatas'] # Specify what data to return
)
# Print the results nicely
import json
print("\nSearch Results:")
print(json.dumps(results, indent=2))
The collection.query
method takes our query_texts
. We ask for the top n_results=2
matches for each query. The include
parameter lets us specify which parts of the stored data we want back (the original documents, the distance scores, and the metadata). The results show the closest matches from our collection for each query, along with their distances (lower cosine distance means higher similarity).
Vector databases often allow combining similarity search with metadata filtering. This is powerful for narrowing down results based on specific attributes. Let's find documents similar to "European landmarks" but only consider those with the topic 'Geography' or 'Art'.
# Query with metadata filtering
filtered_results = collection.query(
query_texts=["European landmarks"],
n_results=2,
where={"topic": {"$in": ["Geography", "Art"]}}, # Filter: topic must be 'Geography' OR 'Art'
include=['documents', 'distances', 'metadatas']
)
print("\nFiltered Search Results (Topic: Geography or Art):")
print(json.dumps(filtered_results, indent=2))
The where
clause uses a dictionary to define filter conditions. Here, $in
specifies that the topic
field in the metadata must be one of the values in the provided list. Notice how this potentially changes the results compared to an unfiltered search.
As mentioned, many vector databases, including ChromaDB, handle updates using an upsert
operation. If you call add
or upsert
with an ID that already exists in the collection, ChromaDB will replace the existing entry (vector, document, metadata) with the new data provided for that ID.
# Let's update the metadata for the first document
first_id = ids[0]
print(f"\nUpdating metadata for ID: {first_id}")
# Get the original document text for this ID
original_doc = collection.get(ids=[first_id], include=['documents'])['documents'][0]
collection.update(
ids=[first_id],
metadatas=[{'genre': 'Classic Fiction', 'year': 1951, 'topic': 'Literature', 'status': 'Updated'}]
# You can also update documents or embeddings if needed
# documents=[new_document_text] # Example if you wanted to change the text
)
# Verify the update
updated_item = collection.get(ids=[first_id], include=['metadatas'])
print("Updated item metadata:")
print(json.dumps(updated_item['metadatas'][0], indent=2))
We use collection.update
specifically targeting the first_id
. We only provide the metadatas
argument, so only the metadata for that item is overwritten. The original vector and document remain associated with that ID unless explicitly updated.
Finally, you can remove items from the collection using their IDs.
# Delete the second item we added
item_to_delete_id = ids[1]
print(f"\nDeleting item with ID: {item_to_delete_id}")
initial_count = collection.count()
collection.delete(ids=[item_to_delete_id])
final_count = collection.count()
print(f"Collection count before delete: {initial_count}")
print(f"Collection count after delete: {final_count}")
# Verify it's gone (attempting to get it should yield an empty result or error)
try:
deleted_item_check = collection.get(ids=[item_to_delete_id])
if not deleted_item_check['ids']:
print(f"Item {item_to_delete_id} successfully deleted.")
else:
print(f"Item {item_to_delete_id} deletion failed.") # Should not happen
except Exception as e:
# Depending on the client version, it might raise an error or return empty
print(f"Item {item_to_delete_id} not found (likely deleted). Error: {e}")
The collection.delete
method removes the specified item(s) from the collection. We verify the deletion by checking the collection count and attempting to retrieve the deleted item.
This hands-on exercise demonstrated the fundamental lifecycle of data within a vector database using ChromaDB:
Basic workflow interacting with a vector database collection.
We connected to the database, created a structured container (collection), added vectorized data along with descriptive metadata, performed similarity searches (both general and filtered), updated an entry, and finally removed an item. These core operations form the building blocks for implementing semantic search and other applications powered by vector similarity. Feel free to experiment further by adding more data, trying different queries, and exploring various metadata filter combinations.
© 2025 ApX Machine Learning