While traditional databases excel at storing and retrieving structured data based on exact matches or range queries on predefined fields, vector databases are built around a different core principle: managing and searching high-dimensional vectors based on similarity. This necessitates distinct data models and schema concepts optimized for this task.
At the heart of a vector database's data model lies the concept of a vector record (sometimes called an item, point, or document). This is the fundamental unit of data storage. Each vector record typically bundles together several key pieces of information:
A Unique Identifier (ID): Just like in most databases, each vector record needs a unique ID for direct retrieval, updates, or deletion. This is often a string or an integer provided by the user or generated by the database.
The Vector Embedding: This is the core numerical representation of the data point, usually stored as an array or list of floating-point numbers. For example, a 768-dimensional text embedding would be stored as an array containing 768 numbers. A critical aspect here is dimensionality consistency. Within a specific collection or index in the database, all vectors typically must have the same dimension (d). Trying to insert vectors of varying dimensions into the same index is usually not supported or requires special handling, as the underlying ANN algorithms rely on operating within a consistent vector space.
Metadata Payload: This is arguably what makes vector databases truly powerful for real-world applications. Vectors rarely exist in isolation; they represent something concrete, like a document, an image, a product, or a user profile. The metadata payload stores attributes associated with the vector, providing context and enabling more refined searches. This payload is often structured like a JSON object or a dictionary, containing key-value pairs.
The metadata associated with a vector can include a wide range of data types, similar to those found in NoSQL or relational databases:
is_active
, on_sale
, or is_processed
.Consider an example of a vector record representing a chunk of text from a document:
{
"id": "doc1_chunk3",
"vector": [0.013, -0.245, ..., 0.912], // A 512-dimensional vector
"metadata": {
"document_source": "/path/to/research_paper.pdf",
"page_number": 15,
"chunk_length": 350, // Number of characters or tokens
"topic": "Machine Learning",
"keywords": ["vector database", "ANN", "similarity search"],
"published_year": 2023
}
}
In this example, id
uniquely identifies the record, vector
holds the embedding, and metadata
contains contextual information about the text chunk.
How strictly this structure (vector dimension, metadata fields, data types) must be defined beforehand varies between different vector database systems.
Regardless of the approach, understanding the structure of your vector records, the fixed vector dimensionality and the available metadata fields and their types, is essential. This structure directly influences how you build indexes (covered in the next chapter) and, critically, how you query the database. The power of vector databases often comes from combining vector similarity search with precise filtering based on these metadata attributes (e.g., "find documents semantically similar to this query, but only those published after 2022 and tagged with 'ANN'"). This dual capability is enabled by a data model that tightly couples the high-dimensional vector with its descriptive metadata.
© 2025 ApX Machine Learning