Hands-on exercises using Python demonstrate the practical application of hash tables, feature hashing, and Locality-Sensitive Hashing (LSH). These exercises solidify understanding of how hashing is applied in machine learning tasks, particularly for handling large or high-dimensional data efficiently.Implementing Basic Hashing and Collision HandlingFirst, let's revisit the core idea of a hash table and collision handling. While Python's built-in dict handles this internally, manually implementing a simple version helps illustrate the mechanics. We'll use separate chaining, where each bucket in the hash table holds a list of items that hash to the same index.import pprint # For pretty printing class SimpleHashTable: def __init__(self, size=10): self.size = size # Initialize table with empty lists for chaining self.table = [[] for _ in range(self.size)] def _hash_function(self, key): # A simple modulo hash function return hash(key) % self.size def insert(self, key, value): hash_index = self._hash_function(key) bucket = self.table[hash_index] # Check if key already exists in the bucket for i, (existing_key, _) in enumerate(bucket): if existing_key == key: # Update existing key's value bucket[i] = (key, value) return # Key not found, append new key-value pair bucket.append((key, value)) def search(self, key): hash_index = self._hash_function(key) bucket = self.table[hash_index] # Search for the key within the bucket's list for existing_key, value in bucket: if existing_key == key: return value # Key not found return None # Example Usage ht = SimpleHashTable(size=5) ht.insert("apple", 1) ht.insert("banana", 2) ht.insert("cherry", 3) ht.insert("date", 4) # Potential collision depending on hash() output ht.insert("elderberry", 5) # Potential collision ht.insert("fig", 6) # Potential collision print("Hash Table Structure:") pprint.pprint(ht.table) print("\nSearch Results:") print(f"Search 'banana': {ht.search('banana')}") print(f"Search 'grape': {ht.search('grape')}") print(f"Search 'date': {ht.search('date')}") # Demonstrate update ht.insert("apple", 100) print(f"\nSearch 'apple' after update: {ht.search('apple')}") print("\nHash Table Structure after update:") pprint.pprint(ht.table)Experiment with different size values and input keys. Notice how collisions (multiple items in the same inner list) occur and how separate chaining handles them. When collisions are frequent, the lookup time degrades from the ideal $O(1)$ towards $O(n)$ in the worst case (all keys hashing to the same bucket), highlighting the importance of a good hash function and appropriate table sizing.Applying Feature Hashing with Scikit-learnFeature hashing is particularly useful for converting high-dimensional, sparse text data into numerical feature vectors of a fixed size. Scikit-learn provides the HashingVectorizer for this purpose. Let's compare it to the standard CountVectorizer.from sklearn.feature_extraction.text import HashingVectorizer, CountVectorizer import numpy as np # Sample text data (e.g., from user reviews or documents) corpus = [ 'this is the first document great document', 'this document is the second document', 'and this is the third one', 'is this the first document again maybe', ] # 1. Using CountVectorizer (creates a vocabulary) count_vectorizer = CountVectorizer() count_features = count_vectorizer.fit_transform(corpus) print("--- CountVectorizer ---") print(f"Vocabulary size: {len(count_vectorizer.vocabulary_)}") # print(f"Vocabulary: {count_vectorizer.vocabulary_}") # Can be large! print(f"Feature matrix shape: {count_features.shape}") # print(f"Feature matrix (sparse):\n{count_features.toarray()}") # Dense representation # 2. Using HashingVectorizer (fixed number of features) # Choose n_features (the dimensionality of the output space) # A smaller n_features increases collision probability but saves memory. n_features_hashed = 10 # Significantly smaller than the vocabulary size hashing_vectorizer = HashingVectorizer(n_features=n_features_hashed, norm=None, alternate_sign=False) hashed_features = hashing_vectorizer.fit_transform(corpus) print("\n--- HashingVectorizer ---") print(f"Number of features (fixed): {n_features_hashed}") print(f"Feature matrix shape: {hashed_features.shape}") # Convert to dense array to inspect. Note potential collisions. print(f"Feature matrix (sparse -> dense):\n{hashed_features.toarray()}") # Notice values > 1 can occur due to collisions where different words hash to the same feature index. # The `alternate_sign=True` (default) option helps mitigate collision impact by randomly assigning +1 or -1.Observations:The HashingVectorizer produces a matrix with a predefined number of columns (n_features), regardless of the vocabulary size. This controls memory usage directly.It processes the text streamingly without building an explicit vocabulary, making it suitable for very large datasets.The downside is potential hash collisions: different words might map to the same feature index. The hashed_features.toarray() output might show values greater than 1 or negative values (if alternate_sign=True) in positions corresponding to collisions. This makes the resulting features less interpretable than those from CountVectorizer.The choice of n_features involves a trade-off: smaller values save more memory but increase collisions, potentially hurting model performance. Larger values reduce collisions but use more memory.This exercise demonstrates how feature hashing efficiently encodes sparse data, a common scenario in Natural Language Processing (NLP) and recommendation systems.Exploring Approximate Similarity Search with LSH (MinHash Example)Implementing a full LSH index is involved, but we can use a library like datasketch to illustrate the core idea using MinHash, which is suitable for estimating Jaccard similarity between sets.First, ensure you have datasketch installed: pip install datasketchNow, let's create MinHash signatures for sample sets and use an LSH index for approximate nearest neighbor search.from datasketch import MinHash, MinHashLSH # Sample sets representing documents or user preferences set1 = set(["cat", "dog", "mouse", "bird", "fish"]) set2 = set(["cat", "dog", "mouse", "lion", "tiger"]) # Similar to set1 set3 = set(["apple", "banana", "orange", "grape", "kiwi"]) # Dissimilar set4 = set(["cat", "dog", "fish", "hamster", "parrot"]) # Moderately similar to set1 # Data to index data = { "set1": set1, "set2": set2, "set3": set3, "set4": set4 } # Parameters for MinHash # Higher num_perm means better accuracy but more computation/storage num_perm = 128 # Create MinHash objects for each set minhashes = {} for key, item_set in data.items(): m = MinHash(num_perm=num_perm) for element in item_set: m.update(element.encode('utf8')) minhashes[key] = m # Optional: Calculate exact Jaccard similarity for comparison later # print(f"MinHash signature for {key}: {minhashes[key].digest()}") # Inspect signature if needed # Create an LSH index # Threshold defines the Jaccard similarity threshold for candidate pairs lsh = MinHashLSH(threshold=0.5, num_perm=num_perm) # Index the MinHash objects print("\n--- Indexing Sets in LSH ---") for key, m_hash in minhashes.items(): lsh.insert(key, m_hash) print(f"Indexed '{key}'") # Define a query set and create its MinHash query_set = set(["cat", "dog", "mouse", "rabbit", "horse"]) query_key = "query" m_query = MinHash(num_perm=num_perm) for element in query_set: m_query.update(element.encode('utf8')) # Query the LSH index for approximate nearest neighbors print(f"\n--- Querying LSH for sets similar to '{query_key}' (threshold > 0.5) ---") result = lsh.query(m_query) print(f"Found approximate neighbors: {result}") # Verification (optional): Calculate exact Jaccard similarity print("\n--- Verification (Exact Jaccard Similarity) ---") print(f"Query Set: {query_set}") for key in result: intersect_size = len(query_set.intersection(data[key])) union_size = len(query_set.union(data[key])) jaccard = intersect_size / union_size if union_size > 0 else 0 print(f" Jaccard similarity with '{key}' ({data[key]}): {jaccard:.4f}") # Compare with a dissimilar set not found by LSH query intersect_size_set3 = len(query_set.intersection(data["set3"])) union_size_set3 = len(query_set.union(data["set3"])) jaccard_set3 = intersect_size_set3 / union_size_set3 if union_size_set3 > 0 else 0 print(f" Jaccard similarity with 'set3' ({data['set3']}): {jaccard_set3:.4f}")Observations:MinHash converts sets into fixed-size signatures (arrays of integers) such that the probability of two signatures matching in a position is related to the Jaccard similarity of the original sets.The LSH index groups similar MinHash signatures into buckets.Querying involves hashing the query signature and checking only the items within the corresponding buckets, avoiding a full comparison against all indexed items.The results (result list) contain keys of sets whose estimated Jaccard similarity with the query set is above the specified threshold. This is an approximate search; it might miss some neighbors (false negatives) or include items slightly below the threshold (false positives), but it's significantly faster than exact comparison for large datasets.Adjusting num_perm and threshold allows you to control the trade-off between accuracy and query speed.These practical exercises demonstrate how hashing techniques are implemented to address specific challenges in machine learning, from efficient feature representation to scalable similarity search. Understanding these implementations helps in choosing the right tools and tuning their parameters effectively within your ML pipelines.