Working with machine learning models often involves dealing with features that are high-dimensional and sparse. Think about text data represented as a bag-of-words, where the vocabulary size can be huge, or categorical features with millions of unique values (like user IDs or product codes). Traditional methods like one-hot encoding create vectors with mostly zeros, leading to massive memory consumption and potentially slower computations.
Feature hashing, sometimes called the "hashing trick," provides an efficient way to map these high-dimensional feature representations into a lower-dimensional space of fixed size, without needing to pre-build a vocabulary or mapping dictionary. This makes it particularly suitable for streaming data or scenarios where the full set of features isn't known beforehand.
The core idea is simple: instead of assigning a unique index to each distinct feature (like one-hot encoding does), we apply a hash function to the feature identifier (e.g., the word itself, or a "category=value" string) and use the resulting hash value, modulo the desired output vector size m, as the index in the new feature vector.
Let's say we want to map our features into a vector of size m. For a given feature f (e.g., the word "apple" or the feature "country=USA"), the process is:
Features are mapped to indices in a fixed-size vector using a hash function and the modulo operator. Note that
feature_A
andfeature_D
collide, mapping to the same index1
.
The size m of the output vector is a hyperparameter you choose. It determines the degree of dimensionality reduction and the probability of collisions. A smaller m saves more memory but increases collisions; a larger m reduces collisions but uses more memory.
A consequence of mapping an potentially large feature space into a fixed-size vector is that different features might hash to the same index. This is a hash collision. In feature hashing, collisions mean that multiple original features contribute to the same component in the resulting vector.
While this sounds problematic, in many machine learning applications, especially with sparse data and linear models (like Logistic Regression or SVMs), the negative impact of collisions can be surprisingly small. The effects might average out, or the model might learn to weigh the combined features appropriately.
To further mitigate the potential negative effects of collisions where features simply sum up at the same index, a common refinement uses a second hash function, g(f), to determine the sign of the value being added to the target index.
This way, if two features collide at index i, their contributions might add (+1,+1) or subtract (+1,−1), reducing the chance of constructive interference unduly inflating the value at that index.
Imagine processing the text "the quick brown fox" and hashing it into a vector of size m=5. Assume we have hash functions h and g:
Feature (Word) | h(f) | i=h(f)(mod5) | g(f) (Sign) | Vector Update | Resulting Vector |
---|---|---|---|---|---|
"the" | 123 | 3 | +1 | vec[3] += 1 |
[0, 0, 0, 1, 0] |
"quick" | 456 | 1 | -1 | vec[1] -= 1 |
[0, -1, 0, 1, 0] |
"brown" | 789 | 4 | +1 | vec[4] += 1 |
[0, -1, 0, 1, 1] |
"fox" | 234 | 4 | -1 | vec[4] -= 1 |
[0, -1, 0, 1, 0] (collision) |
In this example, "brown" and "fox" collide at index 4. Using the signed hash, their contributions cancel each other out (+1 and -1). Without the signed hash, index 4 would have ended up with a value of 2.
While you could implement feature hashing manually using Python's built-in hash()
(being mindful of its potential variability across sessions/platforms for non-numeric types) and the modulo operator, libraries like Scikit-learn provide robust implementations.
Scikit-learn's sklearn.feature_extraction.FeatureHasher
is a convenient tool:
from sklearn.feature_extraction import FeatureHasher
import numpy as np
# Initialize FeatureHasher with desired output dimension (n_features)
# input_type='string' assumes input is iterable of strings
# alternate_sign=True enables the signed hash trick
h = FeatureHasher(n_features=10, input_type='string', alternate_sign=True)
# Example data: list of dictionaries (one per sample)
# Keys are feature names, values are feature values (often 1 for presence)
raw_data = [
{'cat': 1, 'dog': 1, 'mouse': 1},
{'cat': 1, 'bird': 1},
{'dog': 1, 'fish': 1, 'cat': 1}
]
# Transform the data
hashed_features = h.transform(raw_data)
# Output is a SciPy sparse matrix
print(hashed_features.toarray())
# Example Output (values depend on the hash function implementation):
# [[ 0. 1. 0. 0. 0. -1. 0. 0. 1. 0.]
# [ 0. 1. 0. 0. 0. 0. 0. 0. 1. 0.]
# [ 0. 1. 0. 0. 0. -1. -1. 0. 1. 0.]]
print(f"Original number of potential features: Infinite (or very large)")
print(f"Hashed features shape: {hashed_features.shape}")
# Hashed features shape: (3, 10)
Using Scikit-learn's
FeatureHasher
to transform categorical features into a fixed-size sparse matrix representation.
Notice that FeatureHasher
directly outputs a sparse matrix, which is memory-efficient, storing only the non-zero elements.
k
in the vector represents an aggregation of potentially many original features; you can't easily trace it back to a specific original feature like "country=USA".Feature hashing is particularly effective in scenarios involving:
By applying the principles of hashing discussed earlier, feature hashing provides a powerful and practical technique for dimensionality reduction in machine learning, trading off perfect feature separation for significant gains in memory efficiency and computational scalability.
© 2025 ApX Machine Learning