When dealing with categorical features, especially those with a very large number of unique values (high cardinality), methods like One-Hot Encoding can lead to an explosion in the number of features, potentially making models computationally expensive and prone to overfitting. Similarly, Target Encoding requires careful handling to prevent data leakage. The Hashing Encoder, often referred to as the "hashing trick," offers an alternative approach that effectively handles high cardinality features and is suitable for online learning scenarios while keeping the output dimensionality fixed.
The core idea behind hashing encoding is to use a hash function to map potentially numerous category values into a predefined, fixed number of dimensions (output features). Instead of assigning a unique column to each category (like One-Hot Encoding) or calculating statistics based on the target variable (like Target Encoding), hashing applies a hash function to the category name (usually represented as a string) and then uses the resulting hash value to determine which output column(s) will represent that category.
%
). If you want k output features, you calculate hash_value % k
. This operation maps the hash value to an index between 0 and k−1.hash_value % k
) is typically set to 1. Other implementations might use different schemes, such as using +1 or -1 based on another bit of the hash to potentially mitigate collision effects slightly, or even using feature counts.Consider a categorical feature 'City' with values like 'London', 'Paris', 'Tokyo', 'New York'. If we choose n_components=3
for our Hashing Encoder:
1234567890
-> 1234567890 % 3
-> index 0
9876543210
-> 9876543210 % 3
-> index 0
5555555555
-> 5555555555 % 3
-> index 1
1122334455
-> 1122334455 % 3
-> index 2
The resulting encoded features might look like this:
Original | Hash Feature 0 | Hash Feature 1 | Hash Feature 2 |
---|---|---|---|
London | 1 | 0 | 0 |
Paris | 1 | 0 | 0 |
Tokyo | 0 | 1 | 0 |
New York | 0 | 0 | 1 |
Notice a significant issue here: 'London' and 'Paris' have mapped to the same output feature (index 0). This is called a hash collision.
n_components
), regardless of the number of unique categories. This prevents the dimensionality explosion seen with One-Hot Encoding on high-cardinality features.n_components
).hash_0
doesn't correspond to 'London' or 'Paris' specifically, but rather to the set of categories that happen to hash to that index. This makes interpreting model coefficients or feature importances associated with hashed features difficult.n_components
: Selecting the number of output dimensions is a critical hyperparameter. Too few dimensions increase the risk of collisions, potentially harming performance. Too many dimensions might negate some of the dimensionality reduction benefits, although it will still be significantly less than one-hot encoding for very high cardinality features. This often requires tuning based on model validation performance.category_encoders
While Scikit-learn has a HashingVectorizer
often used for text, the category_encoders
library provides a convenient HashingEncoder
specifically for categorical features in DataFrames.
import pandas as pd
import category_encoders as ce
# Sample Data
data = {'Color': ['Red', 'Blue', 'Green', 'Red', 'Yellow', 'Blue', 'Black'],
'Value': [10, 20, 15, 12, 25, 18, 5]}
df = pd.DataFrame(data)
# Define the number of output dimensions (features)
n_output_features = 4
# Initialize the Hashing Encoder
# We specify the column to encode and the desired number of components
encoder = ce.HashingEncoder(cols=['Color'], n_components=n_output_features)
# Fit and transform the data
df_encoded = encoder.fit_transform(df)
print("Original DataFrame:")
print(df)
print("\nDataFrame after Hashing Encoding (n_components=4):")
print(df_encoded)
# Example showing how new categories are handled
new_data = pd.DataFrame({'Color': ['Red', 'Orange', 'Blue'], 'Value': [11, 30, 22]})
df_new_encoded = encoder.transform(new_data) # Use transform, not fit_transform
print("\nEncoding new data (including 'Orange'):")
print(df_new_encoded)
The output df_encoded
will replace the 'Color' column with n_output_features
(in this case, 4) new columns named col_0
, col_1
, col_2
, col_3
. Each row will have values (often 1s, but implementation details vary) distributed across these columns based on the hash of the original 'Color' value. Notice how the unseen category 'Orange' in new_data
is handled without error during the transform
step.
Hashing encoding is particularly useful when:
It's often a pragmatic choice when simpler methods struggle, but be mindful of the potential impact of hash collisions. Experimenting with different values for n_components
and evaluating the effect on downstream model performance is usually necessary.
© 2025 ApX Machine Learning