Having established the necessity of converting categorical features into numerical representations, let's examine how to implement these transformations using Scikit-learn's dedicated tools. Scikit-learn provides Transformer
classes, such as OneHotEncoder
and OrdinalEncoder
, which follow the standard fit
and transform
API pattern, ensuring consistency within the library.
One-Hot Encoding is suitable when the categorical features do not have an inherent order. It transforms each category into a new binary feature (0 or 1). Scikit-learn's OneHotEncoder
is the primary tool for this task.
Let's consider a simple dataset with a categorical feature:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Sample data
data = pd.DataFrame({'color': ['Red', 'Green', 'Blue', 'Green'],
'size': ['M', 'L', 'S', 'M']})
print("Original Data:")
print(data)
# Select the categorical column(s) to encode
categorical_features = ['color'] # Let's encode 'color' first
To apply One-Hot Encoding to the 'color' column:
OneHotEncoder
.fit_transform
method on the selected data. This method first learns the unique categories present (fit
) and then performs the transformation (transform
).# 1. Instantiate the encoder
# By default, it returns a sparse matrix. For demonstration, let's request a dense array.
# handle_unknown='ignore' prevents errors if new categories appear during transformation
encoder_color = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
# 2. Fit and transform the 'color' column
# Encoders expect 2D input, so we use double brackets [['color']]
encoded_colors = encoder_color.fit_transform(data[['color']])
print("\nEncoded 'color' feature (One-Hot):")
print(encoded_colors)
# Get the names of the newly created features
feature_names_color = encoder_color.get_feature_names_out(['color'])
print("\nFeature Names:")
print(feature_names_color)
# Combine with original data (optional, for illustration)
encoded_df_color = pd.DataFrame(encoded_colors, columns=feature_names_color, index=data.index)
data_encoded = pd.concat([data.drop(columns=['color']), encoded_df_color], axis=1)
print("\nData with 'color' One-Hot Encoded:")
print(data_encoded)
Output:
Original Data:
color size
0 Red M
1 Green L
2 Blue S
3 Green M
Encoded 'color' feature (One-Hot):
[[0. 1. 0.]
[1. 0. 0.]
[0. 0. 1.]
[1. 0. 0.]]
Feature Names:
['color_Blue' 'color_Green' 'color_Red']
Data with 'color' One-Hot Encoded:
size color_Blue color_Green color_Red
0 M 0.0 1.0 0.0
1 L 1.0 0.0 0.0
2 S 0.0 0.0 1.0
3 M 1.0 0.0 0.0
Key Parameters for OneHotEncoder
:
sparse_output
: Defaults to True
, returning a SciPy sparse matrix. Sparse matrices are memory-efficient when you have many categories (and thus many resulting columns with mostly zeros). Set to False
to get a standard NumPy dense array, which might be easier to inspect initially but can consume significant memory for high-cardinality features.handle_unknown
: Controls behavior when encountering categories during transform
that were not seen during fit
. The default 'error'
raises an error. Using 'ignore'
assigns zeros to all the one-hot encoded columns for that unknown category, which is often a practical approach.categories
: By default ('auto'
), categories are inferred from the training data. You can provide a specific list of categories if needed.drop
: Can be set to 'first'
or a specific category value to drop one of the binary columns for each feature. This helps avoid multicollinearity in some linear models but can make interpretation less direct.When categories have a meaningful order (e.g., 'low', 'medium', 'high'), OrdinalEncoder
is more appropriate. It assigns a single integer value to each category based on its position in the specified order.
Let's use the 'size' column from our previous example, assuming an order S < M < L.
from sklearn.preprocessing import OrdinalEncoder
# Define the desired order for the 'size' feature
size_order = ['S', 'M', 'L']
# 1. Instantiate the encoder, providing the category order
# The 'categories' parameter expects a list of lists, one for each feature being encoded
encoder_size = OrdinalEncoder(categories=[size_order])
# 2. Fit and transform the 'size' column
# Again, use double brackets for 2D input
encoded_sizes = encoder_size.fit_transform(data[['size']])
print("\nEncoded 'size' feature (Ordinal):")
print(encoded_sizes)
# Add the ordinally encoded column back to the DataFrame
data['size_encoded'] = encoded_sizes.astype(int) # Convert float output to int
print("\nData with 'size' Ordinally Encoded:")
print(data)
Output:
Encoded 'size' feature (Ordinal):
[[1.]
[2.]
[0.]
[1.]]
Data with 'size' Ordinally Encoded:
color size size_encoded
0 Red M 1
1 Green L 2
2 Blue S 0
3 Green M 1
Notice how 'S' is mapped to 0, 'M' to 1, and 'L' to 2, respecting the order we provided.
Key Parameter for OrdinalEncoder
:
categories
: This is the most important parameter. It should be a list containing one list for each feature being encoded. Each inner list specifies the desired order of categories for that feature. If omitted, the encoder determines the order based on the unique values encountered in the data, which usually leads to an arbitrary and potentially meaningless numerical mapping for ordinal data. Always specify the categories
explicitly when using OrdinalEncoder
for features with inherent order.These examples show how to apply encoders to individual columns or subsets of columns. In typical machine learning workflows, you often have multiple numerical and categorical features. Applying different preprocessing steps (like scaling for numerical, encoding for categorical) to different columns is a common requirement. Scikit-learn's ColumnTransformer
, often used within a Pipeline
(covered in Chapter 6), is designed specifically for this scenario, allowing you to apply different transformers to different subsets of columns in a structured way.
Choosing the correct encoder (OneHotEncoder
for nominal, OrdinalEncoder
for ordinal) and applying it correctly using Scikit-learn's transformer API are fundamental steps in preparing your data for effective model training. Remember that OneHotEncoder
significantly increases dimensionality, while OrdinalEncoder
imposes an order that might not exist or be correctly captured if the categories
parameter is misused.
© 2025 ApX Machine Learning