All Courses

Target Encoding (Mean Encoding)

While One-Hot Encoding is effective for nominal features with few categories, it can lead to a high-dimensional feature space when dealing with variables that have many unique values (high cardinality). Ordinal encoding requires a meaningful order, which isn't always present. Target encoding, also known as mean encoding, offers an alternative approach that directly utilizes information from the target variable to create a numerical representation for categories.

The Core Idea

The fundamental principle behind target encoding is to replace each category within a feature with the average value of the target variable associated with that category.

For Regression: Replace the category with the mean of the target variable for all rows belonging to that category.
For Binary Classification: Replace the category with the mean of the target variable (often 0 or 1) for that category. This value effectively represents the empirical probability of the positive class (target=1) given the category.
For Multiclass Classification: You can adapt this by calculating the mean target probability for each class separately, creating multiple new features, or by using other variations. Often, it's applied in a one-vs-rest manner for each class.

Consider a simple binary classification dataset with a categorical feature City and a binary target Purchased:

City	Purchased
London	1
Paris	0
London	0
Tokyo	1
Paris	1
London	1
Tokyo	0

To calculate the target encoding for City:

Group by City: Group the data by the unique values in the City column.
Calculate Mean Target: For each city, calculate the mean of the Purchased column.
- London: Mean(1, 0, 1) = 2/3 ≈ 0.67
- Paris: Mean(0, 1) = 1/2 = 0.50
- Tokyo: Mean(1, 0) = 1/2 = 0.50
Map Values: Replace the original city names with these calculated means.

The encoded feature would look like this:

City (Encoded)	Purchased
0.67	1
0.50	0
0.67	0
0.50	1
0.50	1
0.67	1
0.50	0

This single numerical feature now incorporates information about the likelihood of purchase associated with each city, potentially providing significant predictive power without drastically increasing dimensionality.

The Danger: Target Leakage and Overfitting

Target encoding seems powerful, but performing it naively, as shown above, introduces a significant risk: target leakage.

When you calculate the encoding for a specific row using the target value of that same row, you are leaking information from the target into the feature. The model learns a direct correlation: "When the target for this specific row was 1, the encoded value was slightly higher/lower." This leads to overly optimistic performance during training because the model effectively gets a hint about the target value. However, this performance boost won't generalize to new, unseen data where the target is unknown, resulting in overfitting.

Imagine encoding the City for the first row (London, Purchased=1). The mean calculated (0.67) included that very Purchased=1 value.

Mitigating Target Leakage: Smoothing and Cross-Validation

To use target encoding safely and effectively, you must implement strategies to prevent or minimize target leakage.

1. Smoothing (Regularization)

Categories with very few samples can lead to unreliable mean estimates. For instance, if a city appeared only once and resulted in a purchase, its target encoding would be 1.0, which might be an extreme and noisy estimate. Smoothing addresses this by blending the category's mean with the overall global mean of the target variable.

A common smoothing technique uses the formula:

\text{Smoothed Mean} = \frac{\text{count} \times \text{category\_mean} + m \times \text{global\_mean}}{\text{count} + m}

Where:

count is the number of samples belonging to the category.
category_mean is the original mean target value for that category.
global_mean is the mean target value across the entire dataset.
m is the smoothing factor, a hyperparameter determining the "strength" of the smoothing. A higher m means smaller categories will have their means pulled more strongly towards the global mean.

The intuition is that for categories with a large count, the formula heavily weights the category_mean. For categories with a small count, the global_mean dominates, providing a more conservative estimate. Determining the optimal m often involves cross-validation.

2. Cross-Validation Encoding

A more approach involves calculating encodings within a cross-validation framework. Instead of calculating means on the entire dataset, you compute them only on the training folds and apply them to the corresponding validation fold.

Here's a typical workflow using K-Fold cross-validation:

Split the training data into K folds.
For each fold k (acting as a temporary validation set):
- Calculate the target means for each category using data from the other K-1 folds (acting as a temporary training set). Apply smoothing if desired during this calculation.
- Use these calculated means to encode the categorical feature for the data points within fold k.
Concatenate the encoded features from all K folds to get the target-encoded feature for the entire training set.
For encoding the final test set (or new data), calculate the target means using the entire original training set (potentially with smoothing) and apply these means.

This method ensures that the encoding for any given data point is calculated without using its own target value, effectively preventing direct leakage within the training set evaluation.

import pandas as pd
from sklearn.model_selection import KFold

# Sample Data
data = {'City': ['London', 'Paris', 'London', 'Tokyo', 'Paris', 'London', 'Tokyo', 'Rome', 'Rome'],
        'Purchased': [1, 0, 0, 1, 1, 1, 0, 0, 1]}
df = pd.DataFrame(data)
target = 'Purchased'
feature = 'City'

# Global mean for smoothing and potentially filling NAs later
global_mean = df[target].mean()
# Smoothing factor
m = 1

# Setup KFold
kf = KFold(n_splits=3, shuffle=True, random_state=42)
df[f'{feature}_encoded'] = 0.0 # Initialize encoded column

# Apply CV encoding
for train_index, val_index in kf.split(df):
    df_train, df_val = df.iloc[train_index], df.iloc[val_index]

    # Calculate smoothed means on the training part of the fold
    means = df_train.groupby(feature)[target].agg(['count', 'mean'])
    smoothed_means = (means['count'] * means['mean'] + m * global_mean) / (means['count'] + m)

    # Apply means to the validation part of the fold
    # Use .map() and fill potential new categories in validation with global mean
    df.loc[val_index, f'{feature}_encoded'] = df_val[feature].map(smoothed_means).fillna(global_mean)

print("Original DataFrame with CV Target Encoding:")
print(df)

# Example: Encoding a new data point (use means from the full training set)
full_train_means = df.groupby(feature)[target].agg(['count', 'mean'])
full_smoothed_means = (full_train_means['count'] * full_train_means['mean'] + m * global_mean) / (full_train_means['count'] + m)

new_data = pd.DataFrame({'City': ['Paris', 'Berlin']}) # Berlin is unseen
new_data[f'{feature}_encoded'] = new_data[feature].map(full_smoothed_means).fillna(global_mean)

print("\nEncoding New Data:")
print(new_data)

The code demonstrates K-Fold target encoding with smoothing. Note how the encoded value for each row is derived from means calculated on other folds, and how unseen categories (Berlin) are handled using the global mean.

Implementation Libraries

While you can implement target encoding manually using Pandas as shown, specialized libraries often provide optimized implementations that handle smoothing and cross-validation internally. The category_encoders library is a popular choice:

# Example using category_encoders (requires installation: pip install category_encoders)
# import category_encoders as ce

# Assuming df_train, df_test are your training and test sets
# target_encoder = ce.TargetEncoder(cols=[feature], smoothing=1.0) # Specify smoothing

# Fit on training data (calculates means)
# target_encoder.fit(df_train[feature], df_train[target])

# Transform training and test data
# df_train[f'{feature}_encoded'] = target_encoder.transform(df_train[feature])
# df_test[f'{feature}_encoded'] = target_encoder.transform(df_test[feature])

# Note: category_encoders' TargetEncoder performs smoothing but might not implement
# the strict K-Fold validation encoding by default; check documentation for CV strategies.
# For rigorous leakage prevention, manual K-Fold or integrating it with sklearn pipelines is often preferred.

Advantages and Disadvantages

Advantages:

Captures Target Information: Directly models the relationship between the category and the target variable.
Dimensionality Reduction: Typically creates only one new feature per original categorical feature, regardless of cardinality (unlike One-Hot Encoding).
Often Effective: Can lead to good model performance, especially with tree-based algorithms.

Disadvantages:

Prone to Overfitting: High risk of target leakage if not implemented carefully (using smoothing and/or cross-validation).
Sensitivity to Outliers: Mean calculation can be sensitive to outliers in the target variable, especially for categories with few samples (smoothing helps mitigate this).
Complexity: Correct implementation requires careful handling of training/validation splits and potential unseen categories.
Requires Target Variable: Cannot be calculated without access to the target variable, limiting its use in unsupervised contexts.

Choosing Target Encoding

Target encoding is a powerful technique, particularly useful for:

Categorical features with high cardinality where One-Hot Encoding becomes impractical.
Situations where there's a strong expected relationship between the categories and the target outcome.
Competitions or scenarios where maximizing predictive accuracy is the primary goal (provided overfitting is carefully managed).

Always apply target encoding after splitting your data into training and testing sets, and use techniques like cross-validation encoding and smoothing within your training pipeline to build reliable models.

Was this section helpful?