Continuing our exploration of Approximate Nearest Neighbor (ANN) algorithms, we now turn to Locality-Sensitive Hashing (LSH). Unlike the graph-based exploration of HNSW or the cluster-centric approach of IVF, LSH utilizes a family of specialized hash functions to group similar vectors together probabilistically. The fundamental idea is elegant: if two vectors are close in their high-dimensional space, a well-chosen hash function should assign them the same hash code (or "bucket") with a higher probability than it would for two distant vectors.
The Core Idea: Hashing Similar Items Together
Imagine you have a massive library, and instead of a perfect catalog, you have several approximate categorization systems. One system might group books by the first letter of the author's last name, another by the primary color of the cover, and yet another by the decade of publication. None of these is perfect for finding books on exactly the same topic, but if you query all systems for books matching your target book's characteristics (author initial 'T', blue cover, published in the 1990s), you significantly narrow down your search space.
LSH works similarly. It doesn't guarantee that the absolute nearest neighbor will be in the same hash bucket as the query vector, but it makes it highly probable. To increase this probability, LSH typically employs multiple hash tables, each constructed with different hash functions drawn from the same LSH family.
How LSH Works
- Hashing: Select an appropriate LSH family based on the distance metric used (e.g., Random Projection for Cosine Similarity or Euclidean Distance, MinHash for Jaccard Similarity). Generate multiple hash functions from this family. For each vector in the dataset, compute its hash code using these functions. Often, multiple hash functions are combined to create a single hash signature for each hash table.
- Indexing: Store each vector in a hash table, placing it into the bucket corresponding to its computed hash code. Because collisions (multiple vectors landing in the same bucket) are expected and desired for similar items, buckets often store lists of vector identifiers. As mentioned, multiple independent hash tables are usually created using different sets of hash functions to improve the chances of finding true neighbors.
- Querying: When a query vector arrives, compute its hash code(s) using the same hash functions used for indexing. Retrieve all vectors stored in the corresponding bucket(s) across all hash tables. This collection of vectors forms the candidate set.
- Refinement: Since hashing is approximate, the candidate set may contain vectors that are not true near neighbors (false positives). Perform exact distance calculations (e.g., Cosine Similarity, Euclidean Distance) between the query vector and all vectors in the candidate set. Rank these candidates by their true distance and return the top-k results.
Similar input vectors (like V1, V2 and V3, V4) are likely to collide into the same hash bucket. A query Q might hash to one or more buckets, retrieving candidates (V1, V2, V3, V4) for exact comparison.
LSH Families
The effectiveness of LSH depends heavily on choosing an LSH family suitable for the distance metric being used. Some common examples include:
- Random Projection: This family is often used for approximating Cosine Similarity or Euclidean Distance. The core idea involves generating random hyperplanes (defined by random vectors). A vector's hash is determined by which side of these hyperplanes it falls on. If two vectors are close (small angle between them for Cosine Similarity, or small distance for Euclidean), they are likely to fall on the same side of most random hyperplanes, leading to similar hash codes.
- MinHash: Primarily designed for Jaccard Similarity, which measures the similarity between sets (often used for text documents represented as sets of shingles or words). MinHash uses random permutations of the items and hashes sets based on the minimum item ID observed after a permutation. Sets with high Jaccard similarity are more likely to produce the same MinHash values.
Parameters and Trade-offs
Tuning LSH involves balancing the probability of finding neighbors against computational cost. Important parameters include:
- Number of Hash Tables (L): Using more hash tables increases the likelihood that the query vector and its true neighbors will collide in at least one table, thus improving recall. However, it also increases memory usage and query time, as more buckets need to be checked.
- Number of Hash Functions per Table (k): Using more hash functions per table creates more specific hash signatures (longer hash codes). This reduces the average size of buckets, speeding up the candidate refinement step, but it also decreases the probability of collision for any single table. Finding the right balance between L and k is essential for optimizing the recall-speed trade-off.
Compared to methods like HNSW, LSH can sometimes be more challenging to tune for optimal performance across different datasets. Its performance is sensitive to the choice of LSH family and parameters (L, k). While theoretically appealing, graph-based or quantization-based methods often achieve better empirical results for many common vector search tasks, particularly when high recall is required at low latency. However, LSH remains a foundational ANN technique, especially relevant in specific domains like duplicate detection or when dealing with certain types of data like sets using MinHash.
Understanding LSH provides valuable insight into the probabilistic nature of approximate search and the diverse strategies available for tackling the nearest neighbor problem in high-dimensional spaces.