While One-Hot Encoding is effective for nominal features with few categories, it can lead to a high-dimensional feature space when dealing with variables that have many unique values (high cardinality). Ordinal encoding requires a meaningful order, which isn't always present. Target encoding, also known as mean encoding, offers an alternative approach that directly utilizes information from the target variable to create a numerical representation for categories.
The fundamental principle behind target encoding is to replace each category within a feature with the average value of the target variable associated with that category.
Consider a simple binary classification dataset with a categorical feature City
and a binary target Purchased
:
City | Purchased |
---|---|
London | 1 |
Paris | 0 |
London | 0 |
Tokyo | 1 |
Paris | 1 |
London | 1 |
Tokyo | 0 |
To calculate the target encoding for City
:
City
column.Purchased
column.
The encoded feature would look like this:
City (Encoded) | Purchased |
---|---|
0.67 | 1 |
0.50 | 0 |
0.67 | 0 |
0.50 | 1 |
0.50 | 1 |
0.67 | 1 |
0.50 | 0 |
This single numerical feature now incorporates information about the likelihood of purchase associated with each city, potentially providing significant predictive power without drastically increasing dimensionality.
Target encoding seems powerful, but performing it naively, as shown above, introduces a significant risk: target leakage.
When you calculate the encoding for a specific row using the target value of that same row, you are leaking information from the target into the feature. The model learns a direct correlation: "When the target for this specific row was 1, the encoded value was slightly higher/lower." This leads to overly optimistic performance during training because the model effectively gets a hint about the target value. However, this performance boost won't generalize to new, unseen data where the target is unknown, resulting in overfitting.
Imagine encoding the City
for the first row (London
, Purchased
=1). The mean calculated (0.67) included that very Purchased
=1 value.
To use target encoding safely and effectively, you must implement strategies to prevent or minimize target leakage.
Categories with very few samples can lead to unreliable mean estimates. For instance, if a city appeared only once and resulted in a purchase, its target encoding would be 1.0, which might be an extreme and noisy estimate. Smoothing addresses this by blending the category's mean with the overall global mean of the target variable.
A common smoothing technique uses the formula:
Smoothed Mean=count+mcount×category_mean+m×global_meanWhere:
count
is the number of samples belonging to the category.category_mean
is the original mean target value for that category.global_mean
is the mean target value across the entire dataset.m
is the smoothing factor, a hyperparameter determining the "strength" of the smoothing. A higher m
means smaller categories will have their means pulled more strongly towards the global mean.The intuition is that for categories with a large count
, the formula heavily weights the category_mean
. For categories with a small count
, the global_mean
dominates, providing a more conservative estimate. Determining the optimal m
often involves cross-validation.
A more robust approach involves calculating encodings within a cross-validation framework. Instead of calculating means on the entire dataset, you compute them only on the training folds and apply them to the corresponding validation fold.
Here's a typical workflow using K-Fold cross-validation:
k
(acting as a temporary validation set):
k
.This method ensures that the encoding for any given data point is calculated without using its own target value, effectively preventing direct leakage within the training set evaluation.
import pandas as pd
from sklearn.model_selection import KFold
# Sample Data
data = {'City': ['London', 'Paris', 'London', 'Tokyo', 'Paris', 'London', 'Tokyo', 'Rome', 'Rome'],
'Purchased': [1, 0, 0, 1, 1, 1, 0, 0, 1]}
df = pd.DataFrame(data)
target = 'Purchased'
feature = 'City'
# Global mean for smoothing and potentially filling NAs later
global_mean = df[target].mean()
# Smoothing factor
m = 1
# Setup KFold
kf = KFold(n_splits=3, shuffle=True, random_state=42)
df[f'{feature}_encoded'] = 0.0 # Initialize encoded column
# Apply CV encoding
for train_index, val_index in kf.split(df):
df_train, df_val = df.iloc[train_index], df.iloc[val_index]
# Calculate smoothed means on the training part of the fold
means = df_train.groupby(feature)[target].agg(['count', 'mean'])
smoothed_means = (means['count'] * means['mean'] + m * global_mean) / (means['count'] + m)
# Apply means to the validation part of the fold
# Use .map() and fill potential new categories in validation with global mean
df.loc[val_index, f'{feature}_encoded'] = df_val[feature].map(smoothed_means).fillna(global_mean)
print("Original DataFrame with CV Target Encoding:")
print(df)
# Example: Encoding a new data point (use means from the full training set)
full_train_means = df.groupby(feature)[target].agg(['count', 'mean'])
full_smoothed_means = (full_train_means['count'] * full_train_means['mean'] + m * global_mean) / (full_train_means['count'] + m)
new_data = pd.DataFrame({'City': ['Paris', 'Berlin']}) # Berlin is unseen
new_data[f'{feature}_encoded'] = new_data[feature].map(full_smoothed_means).fillna(global_mean)
print("\nEncoding New Data:")
print(new_data)
The code demonstrates K-Fold target encoding with smoothing. Note how the encoded value for each row is derived from means calculated on other folds, and how unseen categories (
Berlin
) are handled using the global mean.
While you can implement target encoding manually using Pandas as shown, specialized libraries often provide robust and optimized implementations that handle smoothing and cross-validation internally. The category_encoders
library is a popular choice:
# Example using category_encoders (requires installation: pip install category_encoders)
# import category_encoders as ce
# Assuming df_train, df_test are your training and test sets
# target_encoder = ce.TargetEncoder(cols=[feature], smoothing=1.0) # Specify smoothing
# Fit on training data (calculates means)
# target_encoder.fit(df_train[feature], df_train[target])
# Transform training and test data
# df_train[f'{feature}_encoded'] = target_encoder.transform(df_train[feature])
# df_test[f'{feature}_encoded'] = target_encoder.transform(df_test[feature])
# Note: category_encoders' TargetEncoder performs smoothing but might not implement
# the strict K-Fold validation encoding by default; check documentation for CV strategies.
# For rigorous leakage prevention, manual K-Fold or integrating it with sklearn pipelines is often preferred.
Advantages:
Disadvantages:
Target encoding is a powerful technique, particularly useful for:
Always apply target encoding after splitting your data into training and testing sets, and use techniques like cross-validation encoding and smoothing within your training pipeline to build reliable models.
© 2025 ApX Machine Learning