Just as with traditional databases that store structured tables or documents, vector databases require mechanisms to manage the lifecycle of the data they hold. While their defining characteristic is the ability to perform fast similarity searches on high-dimensional vectors, the fundamental operations of creating, reading, updating, and deleting entries remain essential building blocks. These operations, often grouped under the acronym CRUD, possess distinct characteristics when applied to vector data.
The primary way to populate a vector database is by adding new data points. This operation typically involves providing several pieces of information for each entry:
Many vector databases implement this operation as an upsert
(a portmanteau of update and insert). An upsert operation checks if an entry with the given ID already exists. If it does, the existing entry is updated with the new vector and/or metadata. If it doesn't exist, a new entry is created. This is convenient as it handles both initial insertion and subsequent updates with a single command, simplifying application logic.
Consider an example using a hypothetical Python client:
# Adding document chunks to a collection
points_to_add = [
{
"id": "doc_abc_01",
"vector": [0.05, 0.91, ..., -0.32],
"payload": {"source_doc": "report.pdf", "page": 1, "type": "paragraph"}
},
{
"id": "img_xyz_78",
"vector": [0.67, -0.11, ..., 0.88],
"payload": {"filename": "logo.png", "category": "branding", "project": "Project X"}
}
# Potentially many more points
]
vector_db.upsert(collection_name="knowledge_base", points=points_to_add)
Upon insertion or upsertion, the vector database integrates the new vector into its internal index structures (which we'll discuss in detail in the next chapter on ANN). This step is what makes the vector searchable based on similarity.
While the most frequent read operation is similarity search (finding neighbors of a query vector), there's also the fundamental need to fetch a specific, known data point. This is typically done using the unique ID assigned during creation. It's analogous to SELECT * FROM table WHERE id = ?
in SQL.
Retrieving by ID is necessary for various tasks:
# Fetching specific points by their IDs
retrieved_items = vector_db.retrieve(
collection_name="knowledge_base",
ids=["doc_abc_01", "img_xyz_78"]
)
# retrieved_items would typically contain a list of objects,
# each including the ID, vector (optional), and payload.
for item in retrieved_items:
print(f"ID: {item.id}, Payload: {item.payload}")
This operation is a direct lookup and does not involve vector similarity calculations.
Data isn't always static. You might need to modify existing entries in your vector database. Updates generally fall into two categories:
Updating the vector itself usually requires the database to effectively remove the old vector's position in the index and insert the new one, triggering a re-indexing for that specific point. As mentioned earlier, the upsert
operation is often the preferred way to handle updates. By providing an existing ID with new vector or payload data, you overwrite the previous state.
# Updating metadata for an existing point using upsert
point_to_update = {
"id": "doc_abc_01",
"vector": [0.05, 0.91, ..., -0.32], # Vector might be the same or updated
"payload": {"source_doc": "report_v2.pdf", "page": 1, "type": "paragraph", "status": "revised"} # Updated payload
}
vector_db.upsert(collection_name="knowledge_base", points=[point_to_update])
Removing data points is straightforward: you provide the unique ID(s) of the entries you wish to delete.
# Deleting points by ID
vector_db.delete(
collection_name="knowledge_base",
ids=["img_xyz_78"]
)
When an entry is deleted, the database must ensure it's removed from storage and, critically, from the search index. If not removed from the index, it could still erroneously appear in similarity search results. The exact mechanism (e.g., immediate removal, marking for later garbage collection) varies between databases and can impact performance during and after deletion.
Performing CRUD operations one by one can be inefficient, especially when dealing with large numbers of vectors. Sending individual network requests for each operation incurs significant overhead. To mitigate this, most vector database clients support batch operations. You can group hundreds or thousands of upsert
, retrieve
, or delete
operations into a single API call. This drastically reduces network latency and often allows the database backend to process the operations more efficiently. Always prefer batch operations when working with more than a handful of entries at a time.
Mastering these CRUD operations is fundamental to managing the data within your vector database. They provide the necessary tools to populate, inspect, modify, and clean up your vector collections, setting the stage for the core capability these databases are built for: performing efficient and meaningful similarity searches at scale.
© 2025 ApX Machine Learning