Now that we've reviewed several techniques for transforming categorical data into numerical representations, let's put them into practice using Python. We'll use Pandas for data manipulation and Scikit-learn along with the category_encoders
library for applying various encoding strategies.
First, let's set up our environment by importing the necessary libraries and creating a sample dataset. This dataset contains different types of categorical features we commonly encounter.
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from category_encoders import TargetEncoder, BinaryEncoder, HashingEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Sample DataFrame
data = {
'color': ['Red', 'Blue', 'Green', 'Blue', 'Red', 'Green', 'Yellow'],
'size': ['M', 'L', 'S', 'M', 'L', 'S', 'M'],
'region': ['North', 'South', 'East', 'West', 'North', 'South', 'East'],
'city': ['NYC', 'LA', 'Boston', 'SF', 'Chicago', 'Miami', 'Austin'],
'temperature': [25, 30, 22, 28, 20, 35, 26], # Example numerical feature
'outcome': [1, 0, 1, 0, 1, 0, 1] # Example target variable
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
One-Hot Encoding is suitable for nominal categorical features where no ordinal relationship exists. It creates a new binary column for each unique category. We can use either Pandas' get_dummies
or Scikit-learn's OneHotEncoder
. While get_dummies
is convenient for quick exploration, OneHotEncoder
integrates better with Scikit-learn pipelines, especially for handling unseen categories during testing.
Using Pandas get_dummies
:
# Apply get_dummies to the 'color' column
df_one_hot_pandas = pd.get_dummies(df, columns=['color'], prefix='color', drop_first=False)
print("\nDataFrame after One-Hot Encoding (Pandas):")
print(df_one_hot_pandas)
Notice how the 'color' column is replaced by multiple color_*
columns. Setting drop_first=True
can remove one category to avoid multicollinearity, which is sometimes useful depending on the model.
Using Scikit-learn OneHotEncoder
:
Using OneHotEncoder
is often preferred within a machine learning workflow. It needs to be "fit" on the training data and then used to "transform" both training and test data.
# Initialize OneHotEncoder
# handle_unknown='ignore' prevents errors if unseen categories appear in test data
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
# Fit and transform the 'color' column
# Note: Scikit-learn encoders expect a 2D array, hence df[['color']]
color_encoded = ohe.fit_transform(df[['color']])
# Create a DataFrame with new feature names
color_encoded_df = pd.DataFrame(color_encoded, columns=ohe.get_feature_names_out(['color']))
# Concatenate with the original DataFrame (dropping the original 'color' column)
df_one_hot_sklearn = pd.concat([df.drop('color', axis=1), color_encoded_df], axis=1)
print("\nDataFrame after One-Hot Encoding (Scikit-learn):")
print(df_one_hot_sklearn)
The result is similar, but using the Scikit-learn transformer allows consistent application across data splits. The main drawback of One-Hot Encoding is the potential for creating a very high number of features if the original category has many unique values (high cardinality).
Ordinal Encoding is used when the categories have a meaningful order. We need to define this order explicitly.
# Define the explicit order for the 'size' column
size_mapping = {'S': 0, 'M': 1, 'L': 2}
ordered_categories = [['S', 'M', 'L']] # List of lists for OrdinalEncoder
# Initialize OrdinalEncoder with the defined order
ordinal_encoder = OrdinalEncoder(categories=ordered_categories)
# Apply Ordinal Encoding to the 'size' column
df['size_encoded'] = ordinal_encoder.fit_transform(df[['size']])
print("\nDataFrame after Ordinal Encoding:")
# Display relevant columns
print(df[['size', 'size_encoded']])
# Drop the original 'size' column if proceeding
# df = df.drop('size', axis=1)
Here, 'S', 'M', and 'L' are mapped to 0, 1, and 2 respectively, preserving the inherent order. Applying this to nominal data (like 'color') would misleadingly imply a non-existent order and distance relationship between categories.
Target Encoding replaces each category with the average value of the target variable for that category. It's particularly useful for high cardinality features but carries a risk of overfitting, especially if some categories are infrequent. Regularization or smoothing techniques are important. We'll use the TargetEncoder
from the category_encoders
library.
# Initialize TargetEncoder
# Smoothing helps prevent overfitting, especially for rare categories
target_encoder = TargetEncoder(cols=['region'], smoothing=1.0)
# Fit and transform the 'region' column using the 'outcome' target
df['region_target_encoded'] = target_encoder.fit_transform(df['region'], df['outcome'])
print("\nDataFrame after Target Encoding:")
# Display relevant columns
print(df[['region', 'outcome', 'region_target_encoded']])
# Drop the original 'region' column if proceeding
# df = df.drop('region', axis=1)
Observe how each region is now represented by a numerical value derived from the average 'outcome' for that region. For example, 'North' appears twice with outcome 1, so its encoded value is closer to 1 than 'South' which appears twice with outcome 0. Smoothing adjusts these values slightly towards the global mean, mitigating the influence of categories with few samples. Be mindful that this encoder uses information from the target variable, so careful validation (e.g., applying encoding learned only from the training set) is necessary to avoid target leakage.
Binary Encoding is a compromise between One-Hot Encoding and Target Encoding for high cardinality features. It first assigns an ordinal integer to each category and then converts these integers to binary code. Each position in the binary code becomes a separate feature column.
# Initialize BinaryEncoder
binary_encoder = BinaryEncoder(cols=['city'], return_df=True)
# Fit and transform the 'city' column
df_binary_encoded = binary_encoder.fit_transform(df)
print("\nDataFrame after Binary Encoding ('city'):")
print(df_binary_encoded)
Notice that the 'city' column (with 7 unique values) is replaced by city_0
, city_1
, and city_2
. Since 22<7≤23, we need 3 binary features. This is significantly fewer than the 7 features One-Hot Encoding would create. Binary encoding captures some uniqueness without the high dimensionality of One-Hot, but the resulting features lack direct interpretability.
The Hashing Encoder uses a hashing function to map categories (represented as strings) to a fixed number of output features. It's memory efficient and handles new categories naturally, making it suitable for very large or streaming datasets. The main drawback is potential information loss due to hash collisions (different categories mapping to the same hash value).
# Initialize HashingEncoder
# n_components specifies the desired number of output features (hash space size)
hashing_encoder = HashingEncoder(cols=['city'], n_components=4)
# Fit and transform the 'city' column
df_hashed = hashing_encoder.fit_transform(df)
print("\nDataFrame after Hashing Encoding ('city' with 4 components):")
print(df_hashed)
The 'city' column is replaced by 4 new hash features (col_0
to col_3
). The number of components (n_components
) is a hyperparameter; a larger number reduces collision probability but increases dimensionality. Hashing is computationally efficient but sacrifices interpretability due to collisions.
The choice of encoding method depends heavily on the specific dataset, the cardinality of the categorical features, and the machine learning model being used.
Method | Pros | Cons | Best For |
---|---|---|---|
One-Hot | Preserves all information; No implied order; Works well with linear models | High dimensionality for high cardinality; Doesn't handle new categories well (by default) | Low cardinality nominal features |
Ordinal | Simple; Preserves order | Assumes meaningful order; Can mislead models if order is arbitrary | Ordinal features |
Target | Captures target relationship; Handles high cardinality better than One-Hot | Prone to overfitting; Risk of target leakage; Requires careful validation | High cardinality features; Tree models |
Binary | Lower dimensionality than One-Hot for high cardinality | Information loss compared to One-Hot; Less interpretable | Medium-high cardinality nominal features |
Hashing | Memory efficient; Handles new categories; Good for online learning | Information loss via collisions; Less interpretable; Sensitive to n_components |
Very high cardinality; Streaming data |
Here's a visual comparison of the number of features generated for the 'city' column (7 unique values) by different methods:
Comparison of feature dimensionality for different encoding techniques applied to the 'city' column.
In practice, you'll often apply different encoders to different columns. Scikit-learn's ColumnTransformer
is perfect for this, allowing you to build preprocessing pipelines that handle various data types correctly and consistently.
# Recreate the original DataFrame for demonstration
df = pd.DataFrame(data)
# Define the mapping for ordinal encoding
size_categories = [['S', 'M', 'L']]
# Define preprocessing steps for different column types
preprocessor = ColumnTransformer(
transformers=[
('num', 'passthrough', ['temperature']), # Keep numerical features as is (or scale them)
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False), ['color']),
('ordinal', OrdinalEncoder(categories=size_categories), ['size']),
('binary', BinaryEncoder(), ['region']), # Using Binary for 'region' as an example
('target', TargetEncoder(smoothing=1.0), ['city']) # Using Target for 'city' as an example
],
remainder='drop' # Drop columns not specified (like 'outcome' if it's just the target)
)
# Create a pipeline (optional, but good practice)
# You would typically add a model step after preprocessing
pipeline = Pipeline(steps=[('preprocess', preprocessor)])
# Fit and transform the data (excluding the target 'outcome')
# Note: TargetEncoder needs y during fit
X = df.drop('outcome', axis=1)
y = df['outcome']
encoded_data = pipeline.fit_transform(X, y) # Pass y for TargetEncoder
# Get feature names after transformation (requires Scikit-learn 1.1+)
# Manually constructing names might be needed for older versions or complex transformers
feature_names = pipeline.named_steps['preprocess'].get_feature_names_out()
# Create a DataFrame with the encoded data and proper column names
df_encoded_pipeline = pd.DataFrame(encoded_data, columns=feature_names)
print("\nDataFrame after applying ColumnTransformer pipeline:")
print(df_encoded_pipeline.head())
This practical application demonstrates how to select and apply various encoding techniques using standard Python libraries. Remember that the effectiveness of an encoding method often depends on the downstream machine learning model. Experimentation and validation are necessary to determine the best approach for your specific problem.
© 2025 ApX Machine Learning