Python provides highly efficient built-in implementations of hash tables through its dictionary (dict
) and set (set
) types. Understanding how these work and how to leverage them is fundamental for many machine learning tasks where fast lookups, insertions, and membership testing are needed.
dict
)At its core, a Python dictionary is a hash map. It stores key-value pairs, allowing you to retrieve a value associated with a specific key very quickly, typically in average time complexity of O(1).
Keys in a dictionary must be hashable, meaning they must have a hash value that doesn't change during their lifetime (associated with the __hash__
method) and they can be compared to other objects (associated with the __eq__
method). Most built-in immutable types like strings, numbers (int, float), and tuples containing only hashable types are hashable. Mutable types like lists and dictionaries cannot be used as keys because their contents (and thus their hash value) could change.
Here's a basic example:
# Create a dictionary
feature_counts = {'word_a': 150, 'word_b': 85, 'word_c': 210}
# Access a value by key (average O(1))
print(f"Count of word_b: {feature_counts['word_b']}")
# Add a new key-value pair (average O(1))
feature_counts['word_d'] = 55
# Check if a key exists (average O(1))
print(f"Does 'word_a' exist? {'word_a' in feature_counts}")
print(f"Does 'word_x' exist? {'word_x' in feature_counts}")
# Update a value (average O(1))
feature_counts['word_a'] += 10
print(f"Updated count of word_a: {feature_counts['word_a']}")
# Get the number of items
print(f"Number of features: {len(feature_counts)}")
Dictionaries in Machine Learning:
categories = ['red', 'green', 'blue', 'green', 'red']
cat_to_index = {category: i for i, category in enumerate(set(categories))}
print(f"Category to Index mapping: {cat_to_index}")
# Output: Category to Index mapping: {'green': 0, 'blue': 1, 'red': 2}
indexed_features = [cat_to_index[cat] for cat in categories]
print(f"Indexed features: {indexed_features}")
# Output: Indexed features: [2, 0, 1, 0, 2]
collections.Counter
is a specialized dictionary subclass for this.
from collections import Counter
document = "the quick brown fox jumps over the lazy dog the quick brown fox"
words = document.split()
word_counts = Counter(words)
print(f"Word counts: {word_counts}")
print(f"Frequency of 'the': {word_counts['the']}")
# Output: Word counts: Counter({'the': 3, 'quick': 2, 'brown': 2, 'fox': 2, 'jumps': 1, 'over': 1, 'lazy': 1, 'dog': 1})
# Output: Frequency of 'the': 3
set
)Sets are unordered collections of unique, hashable elements. Like dictionaries, they use hash tables internally, providing average O(1) time complexity for adding elements and, importantly, for checking membership (testing if an element is present in the set).
# Create a set
unique_tags = {'ml', 'python', 'data_science', 'algorithms'}
# Add an element (average O(1))
unique_tags.add('hashing')
# Check membership (average O(1))
print(f"Is 'python' in the set? {'python' in unique_tags}")
print(f"Is 'java' in the set? {'java' in unique_tags}")
# Remove an element (average O(1))
unique_tags.remove('algorithms') # Raises KeyError if not found
# unique_tags.discard('algorithms') # Doesn't raise error if not found
print(f"Current tags: {unique_tags}")
Sets in Machine Learning:
feature_list = ['temperature', 'humidity', 'pressure', 'temperature', 'wind_speed']
unique_features = set(feature_list)
print(f"Unique features: {unique_features}")
# Output: Unique features: {'pressure', 'humidity', 'wind_speed', 'temperature'}
|
), intersection (&
), difference (-
), and symmetric difference (^
). This is useful for comparing feature sets, finding common elements between groups, etc.
set_A = {'apple', 'banana', 'cherry'}
set_B = {'banana', 'cherry', 'date'}
print(f"Intersection: {set_A & set_B}") # Common elements
print(f"Union: {set_A | set_B}") # All unique elements
print(f"Difference (A - B): {set_A - set_B}") # Elements in A but not B
# Output: Intersection: {'cherry', 'banana'}
# Output: Union: {'date', 'cherry', 'apple', 'banana'}
# Output: Difference (A - B): {'apple'}
While Python handles the low-level details, visualizing the mapping helps understanding. A hash function converts a key into an integer (hash value), and the modulo operator (%
) maps this hash to an index within the table's array.
Keys are processed by a hash function and modulo operation to determine their slot index in the hash table. Collisions (like 'apple' and 'orange' mapping to index 3) are handled internally by Python.
While dictionaries explicitly store the mapping from a feature (key) to its index or value, feature hashing (covered in detail in the section "Feature Hashing for Dimensionality Reduction") uses the hash function directly to determine the index in the output feature vector without storing the original feature key.
Imagine mapping words to indices in a fixed-size vector of length D
.
index = feature_map['word']
(Requires building and storing feature_map
).index = hash('word') % D
(No storage of the mapping needed, collisions are implicitly handled by multiple features potentially mapping to the same index).Python's built-in hash()
function can be used to implement a simple version of this, although library implementations (like Scikit-learn's HashingVectorizer
) are optimized and handle sign hashing and other details.
# Simplified feature hashing concept
feature_vector_size = 10
word1 = "machine"
word2 = "learning"
word3 = "algorithms" # Might collide
index1 = hash(word1) % feature_vector_size
index2 = hash(word2) % feature_vector_size
index3 = hash(word3) % feature_vector_size # Potential collision
print(f"'{word1}' maps to index: {abs(index1)}") # Use abs for non-negative index
print(f"'{word2}' maps to index: {abs(index2)}")
print(f"'{word3}' maps to index: {abs(index3)}")
# Note: hash() values can be negative, hence abs().
# Real implementations might use different hashing functions.
While the average time complexity for dict
and set
operations is O(1), it's important to remember:
Python's dict
and set
are powerful tools backed by efficient hash table implementations. Leveraging them appropriately allows for fast data manipulation and lookup, which are often significant steps in machine learning pipelines, from data preprocessing and feature engineering to implementing aspects of certain algorithms. They also provide a practical foundation for understanding advanced hashing techniques like feature hashing and Locality-Sensitive Hashing.
© 2025 ApX Machine Learning