Now that we've covered the foundations of hash tables, feature hashing, and Locality-Sensitive Hashing (LSH), let's put these techniques into practice. This section provides hands-on exercises using Python to solidify your understanding of how hashing is applied in machine learning tasks, particularly for handling large or high-dimensional data efficiently.
First, let's revisit the core idea of a hash table and collision handling. While Python's built-in dict
handles this internally, manually implementing a simple version helps illustrate the mechanics. We'll use separate chaining, where each bucket in the hash table holds a list of items that hash to the same index.
import pprint # For pretty printing
class SimpleHashTable:
def __init__(self, size=10):
self.size = size
# Initialize table with empty lists for chaining
self.table = [[] for _ in range(self.size)]
def _hash_function(self, key):
# A simple modulo hash function
return hash(key) % self.size
def insert(self, key, value):
hash_index = self._hash_function(key)
bucket = self.table[hash_index]
# Check if key already exists in the bucket
for i, (existing_key, _) in enumerate(bucket):
if existing_key == key:
# Update existing key's value
bucket[i] = (key, value)
return
# Key not found, append new key-value pair
bucket.append((key, value))
def search(self, key):
hash_index = self._hash_function(key)
bucket = self.table[hash_index]
# Search for the key within the bucket's list
for existing_key, value in bucket:
if existing_key == key:
return value
# Key not found
return None
# Example Usage
ht = SimpleHashTable(size=5)
ht.insert("apple", 1)
ht.insert("banana", 2)
ht.insert("cherry", 3)
ht.insert("date", 4) # Potential collision depending on hash() output
ht.insert("elderberry", 5) # Potential collision
ht.insert("fig", 6) # Potential collision
print("Hash Table Structure:")
pprint.pprint(ht.table)
print("\nSearch Results:")
print(f"Search 'banana': {ht.search('banana')}")
print(f"Search 'grape': {ht.search('grape')}")
print(f"Search 'date': {ht.search('date')}")
# Demonstrate update
ht.insert("apple", 100)
print(f"\nSearch 'apple' after update: {ht.search('apple')}")
print("\nHash Table Structure after update:")
pprint.pprint(ht.table)
Experiment with different size
values and input keys. Notice how collisions (multiple items in the same inner list) occur and how separate chaining handles them. When collisions are frequent, the lookup time degrades from the ideal O(1) towards O(n) in the worst case (all keys hashing to the same bucket), highlighting the importance of a good hash function and appropriate table sizing.
Feature hashing is particularly useful for converting high-dimensional, sparse text data into numerical feature vectors of a fixed size. Scikit-learn provides the HashingVectorizer
for this purpose. Let's compare it to the standard CountVectorizer
.
from sklearn.feature_extraction.text import HashingVectorizer, CountVectorizer
import numpy as np
# Sample text data (e.g., from user reviews or documents)
corpus = [
'this is the first document great document',
'this document is the second document',
'and this is the third one',
'is this the first document again maybe',
]
# 1. Using CountVectorizer (creates a vocabulary)
count_vectorizer = CountVectorizer()
count_features = count_vectorizer.fit_transform(corpus)
print("--- CountVectorizer ---")
print(f"Vocabulary size: {len(count_vectorizer.vocabulary_)}")
# print(f"Vocabulary: {count_vectorizer.vocabulary_}") # Can be large!
print(f"Feature matrix shape: {count_features.shape}")
# print(f"Feature matrix (sparse):\n{count_features.toarray()}") # Dense representation
# 2. Using HashingVectorizer (fixed number of features)
# Choose n_features (the dimensionality of the output space)
# A smaller n_features increases collision probability but saves memory.
n_features_hashed = 10 # Significantly smaller than the vocabulary size
hashing_vectorizer = HashingVectorizer(n_features=n_features_hashed, norm=None, alternate_sign=False)
hashed_features = hashing_vectorizer.fit_transform(corpus)
print("\n--- HashingVectorizer ---")
print(f"Number of features (fixed): {n_features_hashed}")
print(f"Feature matrix shape: {hashed_features.shape}")
# Convert to dense array to inspect. Note potential collisions.
print(f"Feature matrix (sparse -> dense):\n{hashed_features.toarray()}")
# Notice values > 1 can occur due to collisions where different words hash to the same feature index.
# The `alternate_sign=True` (default) option helps mitigate collision impact by randomly assigning +1 or -1.
Observations:
HashingVectorizer
produces a matrix with a predefined number of columns (n_features
), regardless of the vocabulary size. This controls memory usage directly.hashed_features.toarray()
output might show values greater than 1 or negative values (if alternate_sign=True
) in positions corresponding to collisions. This makes the resulting features less interpretable than those from CountVectorizer
.n_features
involves a trade-off: smaller values save more memory but increase collisions, potentially hurting model performance. Larger values reduce collisions but use more memory.This exercise demonstrates how feature hashing efficiently encodes sparse data, a common scenario in Natural Language Processing (NLP) and recommendation systems.
Implementing a full LSH index is involved, but we can use a library like datasketch
to illustrate the core idea using MinHash, which is suitable for estimating Jaccard similarity between sets.
First, ensure you have datasketch
installed:
pip install datasketch
Now, let's create MinHash signatures for sample sets and use an LSH index for approximate nearest neighbor search.
from datasketch import MinHash, MinHashLSH
# Sample sets representing documents or user preferences
set1 = set(["cat", "dog", "mouse", "bird", "fish"])
set2 = set(["cat", "dog", "mouse", "lion", "tiger"]) # Similar to set1
set3 = set(["apple", "banana", "orange", "grape", "kiwi"]) # Dissimilar
set4 = set(["cat", "dog", "fish", "hamster", "parrot"]) # Moderately similar to set1
# Data to index
data = {
"set1": set1,
"set2": set2,
"set3": set3,
"set4": set4
}
# Parameters for MinHash
# Higher num_perm means better accuracy but more computation/storage
num_perm = 128
# Create MinHash objects for each set
minhashes = {}
for key, item_set in data.items():
m = MinHash(num_perm=num_perm)
for element in item_set:
m.update(element.encode('utf8'))
minhashes[key] = m
# Optional: Calculate exact Jaccard similarity for comparison later
# print(f"MinHash signature for {key}: {minhashes[key].digest()}") # Inspect signature if needed
# Create an LSH index
# Threshold defines the Jaccard similarity threshold for candidate pairs
lsh = MinHashLSH(threshold=0.5, num_perm=num_perm)
# Index the MinHash objects
print("\n--- Indexing Sets in LSH ---")
for key, m_hash in minhashes.items():
lsh.insert(key, m_hash)
print(f"Indexed '{key}'")
# Define a query set and create its MinHash
query_set = set(["cat", "dog", "mouse", "rabbit", "horse"])
query_key = "query"
m_query = MinHash(num_perm=num_perm)
for element in query_set:
m_query.update(element.encode('utf8'))
# Query the LSH index for approximate nearest neighbors
print(f"\n--- Querying LSH for sets similar to '{query_key}' (threshold > 0.5) ---")
result = lsh.query(m_query)
print(f"Found approximate neighbors: {result}")
# Verification (optional): Calculate exact Jaccard similarity
print("\n--- Verification (Exact Jaccard Similarity) ---")
print(f"Query Set: {query_set}")
for key in result:
intersect_size = len(query_set.intersection(data[key]))
union_size = len(query_set.union(data[key]))
jaccard = intersect_size / union_size if union_size > 0 else 0
print(f" Jaccard similarity with '{key}' ({data[key]}): {jaccard:.4f}")
# Compare with a dissimilar set not found by LSH query
intersect_size_set3 = len(query_set.intersection(data["set3"]))
union_size_set3 = len(query_set.union(data["set3"]))
jaccard_set3 = intersect_size_set3 / union_size_set3 if union_size_set3 > 0 else 0
print(f" Jaccard similarity with 'set3' ({data['set3']}): {jaccard_set3:.4f}")
Observations:
result
list) contain keys of sets whose estimated Jaccard similarity with the query set is above the specified threshold
. This is an approximate search; it might miss some neighbors (false negatives) or include items slightly below the threshold (false positives), but it's significantly faster than exact comparison for large datasets.num_perm
and threshold
allows you to control the trade-off between accuracy and query speed.These practical exercises demonstrate how hashing techniques are implemented to address specific challenges in machine learning, from efficient feature representation to scalable similarity search. Understanding these implementations helps in choosing the right tools and tuning their parameters effectively within your ML pipelines.
© 2025 ApX Machine Learning