Theory provides a solid foundation, but hands-on practice is essential for understanding how vector databases operate. Practice involves direct interaction with a vector database, using a popular, easy-to-set-up library. Fundamental Create, Read, Update, and Delete (CRUD) operations are performed to solidify understanding of how vectors and their associated metadata are managed.We will use chromadb, an open-source vector database designed for simplicity and ease of use, especially for getting started locally. While other powerful options like Qdrant, Weaviate, and Milvus exist (and we'll touch upon them later), ChromaDB allows us to quickly set up an environment and focus on the core interactions.Setting Up Your EnvironmentFirst, ensure you have Python installed (version 3.8 or higher is recommended). You'll need to install the chromadb library and a sentence transformer model for embedding text data. Open your terminal or command prompt and run:pip install chromadb sentence-transformersThis command installs ChromaDB and the sentence-transformers library, which provides easy access to pre-trained models for creating text embeddings.Connecting to the DatabaseChromaDB can run in memory or persist data to disk. For this exercise, we'll use a persistent client that saves data to a local directory.import chromadb import uuid # To generate unique IDs # Set up a persistent client. Data will be stored in the 'my_vector_db' directory. client = chromadb.PersistentClient(path="./my_vector_db") print("ChromaDB client initialized.") # You can verify the storage path: # print(f"Storage path: {client.settings.persist_directory}")This code initializes a client instance that will store its data in a subdirectory named my_vector_db within your current working directory. If the directory doesn't exist, ChromaDB will create it.Creating a CollectionIn ChromaDB (and many other vector databases), data is organized into "collections," which are analogous to tables in relational databases or indices in search engines. Each collection typically holds vectors of the same dimensionality and often uses a specific distance metric.Let's create a collection named documents_collection. ChromaDB allows specifying an embedding function directly, simplifying the process as it will handle text embedding automatically.from chromadb.utils import embedding_functions # Use a pre-built Sentence Transformer model for embeddings # This model works well for general-purpose semantic search sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2") # Create the collection, specifying the embedding function. # If the collection already exists, use get_or_create_collection. collection_name = "documents_collection" try: client.delete_collection(name=collection_name) # Delete if exists from previous run print(f"Collection '{collection_name}' deleted.") except: pass # Collection doesn't exist, ignore the error collection = client.create_collection( name=collection_name, embedding_function=sentence_transformer_ef, metadata={"hnsw:space": "cosine"} # Specify cosine distance (common for text) ) print(f"Collection '{collection.name}' created successfully.")Here, we first attempt to delete any existing collection with the same name to ensure a clean start. Then, we create the collection, passing our chosen embedding function (all-MiniLM-L6-v2 via sentence-transformers) and specifying cosine similarity as the distance metric using the metadata parameter (specifically for the HNSW index ChromaDB uses by default).Adding Data (Create/Upsert)Now, let's add some data. Each item in a ChromaDB collection needs:A unique id.The data itself (in this case, documents, which are text strings).Optional metadata (key-value pairs associated with the vector).ChromaDB's add or upsert methods handle embedding the documents automatically using the function we provided during collection creation. upsert is often preferred as it adds new items or updates existing ones if the ID already exists.# Sample data: documents with associated metadata docs = [ "The Catcher in the Rye is a classic novel by J.D. Salinger.", "Artificial intelligence is transforming many industries.", "Paris is the capital city of France, known for the Eiffel Tower.", "To Kill a Mockingbird examines themes of justice and prejudice." "Machine learning algorithms learn patterns from data.", "The Louvre Museum in Paris houses famous works of art." ] metadata = [ {'genre': 'Fiction', 'year': 1951, 'topic': 'Literature'}, {'genre': 'Non-fiction', 'year': 2023, 'topic': 'Technology'}, {'genre': 'Non-fiction', 'year': 1889, 'topic': 'Geography'}, # Year Eiffel Tower completed {'genre': 'Fiction', 'year': 1960, 'topic': 'Literature'}, {'genre': 'Non-fiction', 'year': 2022, 'topic': 'Technology'}, {'genre': 'Non-fiction', 'year': 1793, 'topic': 'Art'} # Year Louvre opened ] # Generate unique IDs for each document ids = [str(uuid.uuid4()) for _ in docs] # Add the data to the collection collection.add( documents=docs, metadatas=metadata, ids=ids ) print(f"Added {collection.count()} items to the collection.") # Verify one item was added (optional) # print(collection.get(ids=[ids[0]]))We've prepared lists of documents, corresponding metadata dictionaries, and unique IDs. The collection.add method takes these lists. ChromaDB processes the documents, generates embeddings using the all-MiniLM-L6-v2 model, and stores the embeddings along with the documents, metadata, and IDs.Searching for Similar Data (Read)The primary operation in a vector database is similarity search. We provide a query (text in this case), ChromaDB embeds it using the same embedding function, and then finds the stored vectors closest to the query vector based on the chosen distance metric (cosine similarity).# Query the collection query_texts = ["Tell me about famous books.", "What is AI?"] results = collection.query( query_texts=query_texts, n_results=2, # Ask for the top 2 most similar results for each query include=['documents', 'distances', 'metadatas'] # Specify what data to return ) # Print the results nicely import json print("\nSearch Results:") print(json.dumps(results, indent=2))The collection.query method takes our query_texts. We ask for the top n_results=2 matches for each query. The include parameter lets us specify which parts of the stored data we want back (the original documents, the distance scores, and the metadata). The results show the closest matches from our collection for each query, along with their distances (lower cosine distance means higher similarity).Filtering with Metadata (Read)Vector databases often allow combining similarity search with metadata filtering. This is powerful for narrowing down results based on specific attributes. Let's find documents similar to "European landmarks" but only consider those with the topic 'Geography' or 'Art'.# Query with metadata filtering filtered_results = collection.query( query_texts=["European landmarks"], n_results=2, where={"topic": {"$in": ["Geography", "Art"]}}, # Filter: topic must be 'Geography' OR 'Art' include=['documents', 'distances', 'metadatas'] ) print("\nFiltered Search Results (Topic: Geography or Art):") print(json.dumps(filtered_results, indent=2))The where clause uses a dictionary to define filter conditions. Here, $in specifies that the topic field in the metadata must be one of the values in the provided list. Notice how this potentially changes the results compared to an unfiltered search.Updating Data (Update)As mentioned, many vector databases, including ChromaDB, handle updates using an upsert operation. If you call add or upsert with an ID that already exists in the collection, ChromaDB will replace the existing entry (vector, document, metadata) with the new data provided for that ID.# Let's update the metadata for the first document first_id = ids[0] print(f"\nUpdating metadata for ID: {first_id}") # Get the original document text for this ID original_doc = collection.get(ids=[first_id], include=['documents'])['documents'][0] collection.update( ids=[first_id], metadatas=[{'genre': 'Classic Fiction', 'year': 1951, 'topic': 'Literature', 'status': 'Updated'}] # You can also update documents or embeddings if needed # documents=[new_document_text] # Example if you wanted to change the text ) # Verify the update updated_item = collection.get(ids=[first_id], include=['metadatas']) print("Updated item metadata:") print(json.dumps(updated_item['metadatas'][0], indent=2))We use collection.update specifically targeting the first_id. We only provide the metadatas argument, so only the metadata for that item is overwritten. The original vector and document remain associated with that ID unless explicitly updated.Deleting Data (Delete)Finally, you can remove items from the collection using their IDs.# Delete the second item we added item_to_delete_id = ids[1] print(f"\nDeleting item with ID: {item_to_delete_id}") initial_count = collection.count() collection.delete(ids=[item_to_delete_id]) final_count = collection.count() print(f"Collection count before delete: {initial_count}") print(f"Collection count after delete: {final_count}") # Verify it's gone (attempting to get it should yield an empty result or error) try: deleted_item_check = collection.get(ids=[item_to_delete_id]) if not deleted_item_check['ids']: print(f"Item {item_to_delete_id} successfully deleted.") else: print(f"Item {item_to_delete_id} deletion failed.") # Should not happen except Exception as e: # Depending on the client version, it might raise an error or return empty print(f"Item {item_to_delete_id} not found (likely deleted). Error: {e}") The collection.delete method removes the specified item(s) from the collection. We verify the deletion by checking the collection count and attempting to retrieve the deleted item.Summary of OperationsThis hands-on exercise demonstrated the fundamental lifecycle of data within a vector database using ChromaDB:digraph CRUD_Flow { rankdir=LR; node [shape=box, style=filled, fillcolor="#e9ecef", fontname="Arial"]; edge [fontname="Arial"]; Start [label="Initialize Client", shape=ellipse, fillcolor="#a5d8ff"]; CreateCollection [label="Create Collection\n(documents_collection)", fillcolor="#96f2d7"]; AddData [label="Add/Upsert Data\n(Docs, Metadata, IDs)", fillcolor="#ffec99"]; QueryData [label="Query (Similarity Search)", fillcolor="#bac8ff"]; FilterQuery [label="Query with Metadata Filter", fillcolor="#d0bfff"]; UpdateData [label="Update Item\n(Metadata Example)", fillcolor="#ffe066"]; DeleteData [label="Delete Item", fillcolor="#ffc9c9"]; End [label="Interaction Complete", shape=ellipse, fillcolor="#a5d8ff"]; Start -> CreateCollection; CreateCollection -> AddData [label=" Store vectors"]; AddData -> QueryData [label=" Search"]; QueryData -> FilterQuery [label=" Refine search"]; FilterQuery -> UpdateData [label=" Modify"]; UpdateData -> DeleteData [label=" Remove"]; DeleteData -> End; AddData -> UpdateData [label=" Upsert existing ID"]; // Alternative path for update }Basic workflow interacting with a vector database collection.We connected to the database, created a structured container (collection), added vectorized data along with descriptive metadata, performed similarity searches (both general and filtered), updated an entry, and finally removed an item. These core operations form the building blocks for implementing semantic search and other applications powered by vector similarity. Feel free to experiment further by adding more data, trying different queries, and exploring various metadata filter combinations.