As we've seen with TF-IDF and N-grams, representing text numerically can lead to very high-dimensional feature vectors. Imagine a vocabulary of hundreds of thousands, or even millions, of unique words or N-grams. Storing and processing matrices with this many columns demands significant memory and computational resources. This is especially true when dealing with large datasets or streaming data where the full vocabulary isn't known in advance.
Feature hashing, sometimes called the "hashing trick," offers a clever workaround. Instead of building and storing an explicit vocabulary map (like word -> column index), feature hashing uses a hash function to directly map feature names (like words or N-grams) to indices in a fixed-size vector.
The Hashing Mechanism
At its core, feature hashing works like this:
- Choose a Vector Size: You predefine the desired number of dimensions (columns) for your output vector, let's call it D. This size is typically much smaller than the potential vocabulary size.
- Apply a Hash Function: For each feature (e.g., the word "predict" or the bigram "machine learning"), apply a hash function (like MurmurHash3) that outputs an integer.
- Determine the Index: Take the absolute value of the hash function's output and compute its remainder when divided by the chosen vector size D. This gives you an index between 0 and D−1.
index = \text{hash}(\text{"feature_name"}) \pmod{D}
- Update the Vector: The value associated with the feature (e.g., its term frequency or TF-IDF score) is then added to the value at that calculated
index
in the output vector.
Consider a simple example with a target vector size D=10:
hash("learning") % 10 = 3
hash("algorithm") % 10 = 7
hash("model") % 10 = 3
If our document contained these words once each (TF=1), our resulting vector (initially all zeros) might look like:
[0, 0, 0, 2, 0, 0, 0, 1, 0, 0]
Notice that "learning" and "model" both hashed to index 3. This is called a hash collision.
Diagram illustrating how different features are mapped to indices in a fixed-size vector via a hash function. Note the collision where "learning" and "model" map to the same index (3).
Handling Collisions
Collisions are inherent to feature hashing. Multiple distinct features might map to the same index in the output vector. The counts or weights for these colliding features simply add up at that index.
- Trade-off: The smaller you make the target vector size D, the more memory you save, but the higher the probability of collisions.
- Impact: While collisions mean some information is lost (the model can't distinguish between features that hash to the same index), in practice, for sparse high-dimensional data like text, the impact on model performance is often surprisingly small, especially if D is reasonably large (e.g., 218 to 224). The inherent sparsity means many indices might only have one feature mapped to them anyway.
- Signed Hashing: A common enhancement is to use a second hash function to determine the sign (+1 or -1) of the value being added to the index. This way, colliding features have a chance to partially cancel each other out rather than always accumulating positively, which can sometimes mitigate the negative effects of collisions.
Advantages of Feature Hashing
- Memory Efficiency: It requires significantly less memory than storing a full vocabulary mapping, especially for massive feature sets. The memory requirement is fixed by the chosen vector dimension D.
- Online Learning: It's well-suited for scenarios where data arrives sequentially (streaming). You don't need to see the entire dataset upfront to build a vocabulary; new features encountered later are simply hashed into the existing vector space.
- Simplicity: It avoids the complexity of managing a vocabulary dictionary.
Disadvantages of Feature Hashing
- Interpretability: The resulting feature vectors are difficult to interpret. An activation at index
i
could be due to any of the features that hash to i
. You lose the direct mapping from vector index back to a specific word or N-gram.
- Collision Effects: If the hash space (D) is too small relative to the number of unique features, excessive collisions can degrade model performance by mapping too many unrelated features together. Choosing an appropriate D often requires experimentation.
Feature hashing provides a powerful, memory-efficient alternative to traditional vocabulary-based vectorization like Bag-of-Words or TF-IDF, particularly valuable in large-scale or resource-constrained environments. It achieves dimensionality reduction implicitly during the vectorization process, unlike techniques like PCA or SVD which are applied after creating a potentially huge initial feature matrix.