While quantization techniques compress vectors to optimize storage and speed up distance calculations, hashing methods offer a fundamentally different approach to approximate nearest neighbor search. Instead of compressing the vector directly, techniques like Locality Sensitive Hashing (LSH) aim to group similar vectors into the same "buckets" using specialized hash functions. This section provides a refresher on binary hashing and LSH, exploring their principles and relevance in the context of optimizing vector search systems.
From Vectors to Bits: Binary Hashing
At its core, binary hashing seeks to convert high-dimensional floating-point vectors into compact binary codes, typically represented as bit strings. The primary motivation is efficiency: operations on binary codes, particularly calculating the Hamming distance (the number of differing bits), can be significantly faster than calculating Euclidean or cosine distances on the original vectors. Furthermore, binary codes require much less storage.
A simple, illustrative approach might involve defining a set of thresholds. For a vector v=[v1,v2,...,vd], we could generate a binary hash h=[b1,b2,...,bd] where bi=1 if vi>thresholdi and bi=0 otherwise. Another basic technique involves projecting the vector onto a set of random vectors and assigning a bit based on the sign of the projection.
However, simple binary hashing methods often struggle to preserve the locality of the original vectors effectively. Two vectors that are close in the original high-dimensional space might end up with vastly different binary codes, and vice-versa. This limitation led to the development of more sophisticated techniques specifically designed for similarity search.
Locality Sensitive Hashing (LSH)
Locality Sensitive Hashing (LSH) addresses the shortcomings of naive hashing by employing hash functions with a specific, mathematically defined property: collision probability. An LSH family of functions is designed such that:
- Vectors that are close to each other in the original vector space have a high probability of hashing to the same value (colliding).
- Vectors that are far apart in the original vector space have a low probability of hashing to the same value.
More formally, for a distance measure D, a family H of hash functions is called (r1,r2,p1,p2)-sensitive if for any two vectors x,y:
- If D(x,y)≤r1, then Ph∈H[h(x)=h(y)]≥p1
- If D(x,y)≥r2, then Ph∈H[h(x)=h(y)]≤p2
where r1<r2 and p1>p2. The goal is to maximize the gap between p1 and p2.
LSH Families for Common Similarity Metrics
Different LSH families are suited for different distance or similarity metrics:
- Random Projections (for Cosine Similarity/Angular Distance): This is perhaps the most relevant LSH family for the dense embeddings often used in LLM applications, which frequently rely on cosine similarity. The core idea is to generate several random hyperplanes passing through the origin. Each hyperplane divides the space into two halves. A vector's hash code (or a part of it) is determined by which side of these hyperplanes it falls on. For a single hyperplane defined by a random normal vector u, the hash function can be h(v)=1 if u⋅v≥0 and h(v)=0 if u⋅v<0. Vectors with a smaller angle between them (higher cosine similarity) are more likely to fall on the same side of a random hyperplane, thus having a higher probability of collision. Multiple such hash functions (using different random vectors u) are combined to create a longer hash code, increasing specificity.
Vectors v1 and v2 are close (small angle) and fall on the same side of the random hyperplane (h(v1)=h(v2)=1), making a collision likely. Vector v3 is far from v1/v2 and falls on the other side (h(v3)=0).
- MinHash (for Jaccard Similarity): Primarily used for comparing sets (e.g., sets of words in documents). While less common for dense vector embeddings from LLMs, it's a classic LSH technique. It estimates Jaccard similarity based on the probability that random permutations of the items yield the same minimum element for both sets.
- p-Stable Distributions (for Lp distances, e.g., Euclidean): This family uses projections onto random lines, followed by quantization. The hash functions are constructed based on sampling from p-stable distributions (like Gaussian for L2/Euclidean distance). Vectors closer in Lp distance are more likely to map to nearby values after projection and quantization.
Using LSH for Approximate Nearest Neighbor Search
Implementing ANN search with LSH typically involves these steps:
-
Indexing:
- Choose an appropriate LSH family based on the distance metric (e.g., Random Projections for cosine similarity).
- Generate multiple hash tables. Each table uses a different set of hash functions (e.g., by concatenating k basic LSH functions like the random hyperplane test).
- Hash every vector in the database using the functions for each table and store the vector (or its ID) in the corresponding bucket(s).
-
Querying:
- Hash the query vector using the same set of hash functions for each table.
- Identify the bucket(s) the query vector falls into in each hash table.
- Collect all unique vectors (candidates) present in these buckets across all tables.
-
Refinement:
- Calculate the exact distance (e.g., cosine similarity) between the query vector and each candidate vector retrieved in the previous step.
- Return the top-k candidates with the best distances.
Configuration and Trade-offs
The performance of LSH depends heavily on its parameters:
- Number of Hash Tables (L): Increasing L improves the probability of finding near neighbors (higher recall) because a neighbor missed in one table might be caught in another. However, it also increases query time as more buckets need to be checked.
- Number of Hash Functions per Table (k): Increasing k makes the hash codes longer and more discriminative. This reduces the average size of buckets (fewer false positives per bucket) but also decreases the collision probability for true near neighbors (potentially lower recall if L is not increased accordingly). It generally requires careful tuning.
Finding the optimal values for L and k is essential for balancing recall, query latency, and memory usage.
LSH in the Context of Modern Vector Search
While LSH provides a probabilistic foundation for ANN search and was a significant early development, its practical application for dense vectors generated by modern embedding models faces competition.
- Comparison to HNSW/IVF: Graph-based methods like HNSW often achieve higher recall for a given query speed compared to LSH on typical dense vector benchmarks. IVF indexes combined with Product Quantization (IVFPQ) usually offer better memory efficiency for storing compressed vectors compared to the potentially large hash tables needed by LSH for high recall.
- Tuning Complexity: Tuning LSH parameters (L, k, and potentially parameters within the hash function itself) can sometimes be less intuitive than tuning HNSW (e.g.,
efConstruction
, efSearch
) or IVF (nprobe
) parameters.
- Strengths: LSH can still be advantageous in certain scenarios. It might be easier to distribute or update compared to complex graph structures. Its performance can sometimes be less sensitive to the intrinsic dimensionality of the data compared to some tree-based methods. Some specialized database systems leverage LSH variants.
Understanding LSH remains valuable. It provides theoretical insights into the challenges of high-dimensional search and represents a distinct algorithmic approach compared to partitioning/graph methods (IVF, HNSW) and vector compression (Quantization). While you might more frequently encounter HNSW or quantization in state-of-the-art LLM applications, LSH is part of the toolkit for optimizing vector search, especially when memory or specific performance trade-offs are paramount, or when dealing with non-standard distance metrics where other methods might not apply as easily. It serves as a conceptual bridge to understanding why approximation is necessary and how different strategies tackle the "curse of dimensionality".