ChromaDB offers a developer-friendly approach to vector storage and search, often favored for its simplicity, local-first capabilities, and integration with popular Python data science tools. It can operate entirely in memory for quick experiments, persist data to disk for durability between sessions, or run as a client/server application. This flexibility makes it a good choice for development, prototyping, and smaller-scale deployments.
First, ensure you have the chromadb
library installed. If not, you can install it using pip:
pip install chromadb
The primary way to interact with ChromaDB is through its client object. The type of client you instantiate depends on how you intend to use ChromaDB:
In-Memory Client: Ideal for temporary use cases, testing, or when data persistence isn't required. Data is lost when the process ends.
import chromadb
# Creates an in-memory client
client = chromadb.Client()
print("Client type: In-memory")
Persistent Client: Stores data on disk in the specified directory, allowing data to survive across sessions.
import chromadb
# Creates a persistent client storing data in './chroma_db'
client = chromadb.PersistentClient(path="./chroma_db")
print("Client type: Persistent")
HTTP Client: Connects to a running ChromaDB server instance, typically used in production or shared environments. You'll need the host and port of the server.
import chromadb
# Example: Connects to a server running on localhost:8000
# Replace with your actual server host and port if different
try:
client = chromadb.HttpClient(host='localhost', port=8000)
print("Client type: HTTP Client (Connected)")
except Exception as e:
print(f"Could not connect to ChromaDB server: {e}")
print("Ensure a ChromaDB server is running at the specified host and port.")
# Handle connection error appropriately
client = None # Set client to None or exit
Different ways a ChromaDB client can connect to storage or a server instance.
In ChromaDB, data is organized into collections. Think of a collection as analogous to a table in a relational database or an index in other vector databases. Each collection typically stores vectors of the same dimensionality along with their associated metadata.
You can create a new collection or get a handle to an existing one using create_collection()
or get_or_create_collection()
. The latter is often more convenient as it won't raise an error if the collection already exists.
# Using the 'client' variable initialized earlier (assuming it's not None)
if client:
# Create a new collection named 'document_embeddings'
# This will raise an error if it already exists
try:
collection = client.create_collection("document_embeddings")
print("Collection 'document_embeddings' created.")
except Exception as e:
print(f"Could not create collection: {e}")
# Get or create a collection (more common)
# If 'search_index' exists, it gets it. Otherwise, it creates it.
collection = client.get_or_create_collection("search_index")
print(f"Got or created collection: {collection.name}")
# You can specify an embedding function during creation
# By default, ChromaDB uses Sentence Transformers (all-MiniLM-L6-v2)
# More on embedding functions later if needed
# collection = client.get_or_create_collection(
# name="my_custom_embeddings",
# embedding_function=custom_embedding_function # Replace with your function
# )
else:
print("ChromaDB client not initialized.")
Once you have a collection object, you can add data using the add()
method. You need to provide unique IDs for each item, and typically, you'll provide the original documents (text) and associated metadata.
If you don't specify embeddings directly, ChromaDB's configured embedding function (the default Sentence Transformer, or one you specified) will automatically generate them from the documents
.
if client and collection: # Ensure client and collection are valid
try:
collection.add(
documents=[
"This is the first document about apples.",
"The second document discusses oranges and citrus fruits.",
"A final document mentioning bananas and tropical fruits."
],
metadatas=[
{"source": "doc_1", "topic": "fruit"},
{"source": "doc_2", "topic": "fruit"},
{"source": "doc_3", "topic": "fruit"}
],
ids=["id1", "id2", "id3"] # Unique IDs for each document
)
print("Successfully added 3 documents to the collection.")
# You can also add data with pre-computed embeddings
# Ensure the dimensionality matches the collection's expectation
# collection.add(
# embeddings=[[0.1, 0.2, ...], [0.3, 0.4, ...]],
# metadatas=[{"source": "precomputed_1"}, {"source": "precomputed_2"}],
# ids=["pc_id1", "pc_id2"]
# )
except Exception as e:
print(f"Error adding documents: {e}")
# Handle potential issues like duplicate IDs
else:
print("Client or collection not available for adding data.")
Key parameters for add()
:
documents
: A list of strings (the text content). Required if embeddings are not provided and an embedding function is configured.embeddings
: A list of lists/arrays (the vector embeddings). Required if documents are not provided or if you want to use pre-computed vectors.metadatas
: A list of dictionaries, where each dictionary contains metadata key-value pairs for the corresponding document/embedding.ids
: A list of unique strings identifying each entry. These are mandatory and must be unique within the collection.The core function of a vector database is similarity search. In ChromaDB, you use the query()
method on a collection object. You can query using text (which will be automatically embedded) or by providing query vectors directly.
if client and collection: # Ensure client and collection are valid
try:
# Query using text - ChromaDB embeds the query text automatically
results = collection.query(
query_texts=["Tell me about tropical fruits"],
n_results=2 # Ask for the top 2 most similar results
)
print("\nQuery Results (Text Query):")
print(results)
# Example: Querying with a pre-computed embedding vector
# Assuming 'query_vector' is a list or numpy array of the correct dimension
# query_vector = [0.15, 0.25, ...] # Replace with an actual vector
# results_embedding = collection.query(
# query_embeddings=[query_vector],
# n_results=1
# )
# print("\nQuery Results (Embedding Query):")
# print(results_embedding)
except Exception as e:
print(f"\nError querying collection: {e}")
else:
print("\nClient or collection not available for querying.")
The query()
method returns a dictionary containing lists for ids
, distances
, metadatas
, embeddings
, and documents
corresponding to the nearest neighbors found.
Often, you need to combine semantic similarity search with filtering based on metadata attributes. ChromaDB allows this using the where
argument in the query()
method. It accepts a dictionary specifying the filtering conditions.
if client and collection: # Ensure client and collection are valid
try:
# Find documents similar to "citrus" but only from 'doc_2'
filtered_results = collection.query(
query_texts=["citrus"],
n_results=1,
where={"source": "doc_2"} # Filter for metadata field 'source' equal to 'doc_2'
)
print("\nQuery Results (with 'where' filter):")
print(filtered_results)
# Example using a document filter (less common for simple cases, more for complex logic)
# filtered_results_doc = collection.query(
# query_texts=["apples"],
# n_results=1,
# where_document={"$contains": "first"} # Filter documents containing the word "first"
# )
# print("\nQuery Results (with 'where_document' filter):")
# print(filtered_results_doc)
except Exception as e:
print(f"\nError querying collection with filter: {e}")
else:
print("\nClient or collection not available for filtered querying.")
ChromaDB supports various operators within the where
filter, such as $eq
(equal, default), $ne
(not equal), $gt
(greater than), $lt
(less than), $gte
(greater than or equal), $lte
(less than or equal), $in
, $nin
. Refer to the ChromaDB documentation for the full set of filtering capabilities.
ChromaDB's client also provides methods for managing data:
get()
: Retrieve items by their IDs, optionally applying filters.update()
: Modify existing items (documents, embeddings, metadata).upsert()
: Add items if their IDs don't exist, or update them if they do.delete()
: Remove items by ID or by applying filters.peek()
: Retrieve a few items from the collection (useful for inspection).count()
: Get the total number of items in the collection.if client and collection: # Ensure client and collection are valid
try:
print(f"\nTotal items in collection: {collection.count()}")
# Get an item by ID
item = collection.get(ids=["id1"])
print(f"\nRetrieved item 'id1': {item}")
# Update metadata for an item
collection.update(ids=["id1"], metadatas=[{"source": "doc_1_updated", "topic": "fruit", "reviewed": True}])
item_updated = collection.get(ids=["id1"])
print(f"\nUpdated item 'id1': {item_updated}")
# Delete an item
collection.delete(ids=["id3"])
print(f"\nItems after deleting 'id3': {collection.count()}")
except Exception as e:
print(f"\nError during other operations: {e}")
else:
print("\nClient or collection not available for other operations.")
ChromaDB provides a straightforward and Pythonic interface for building vector search applications. Its focus on ease of use and flexible deployment options makes it an excellent tool for getting started and for applications where a managed, large-scale service might be overkill. Remember to consult the official ChromaDB documentation for the most up-to-date API details and advanced features.
© 2025 ApX Machine Learning