While One-Hot Encoding is effective for nominal categories (where order doesn't matter), applying it to ordinal data, categories with a meaningful sequence, discards potentially valuable ranking information. Consider features like customer satisfaction ('Low', 'Medium', 'High') or clothing sizes ('S', 'M', 'L', 'XL'). The order here is intrinsic and might correlate with the target variable you're trying to predict. Ordinal Encoding aims to preserve this sequence by mapping each category to a distinct integer.
Ordinal Encoding works by assigning a numerical rank, typically starting from 0 or 1, to each unique category based on its position in the predefined order. For instance, if we have education levels: 'High School', 'Bachelor's', 'Master's', 'PhD', we could map them as:
This numerical representation directly reflects the inherent progression in the categories.
You can implement Ordinal Encoding manually using Pandas or leverage Scikit-learn's dedicated transformer.
If you know the specific order of your categories, defining a custom mapping dictionary and using Pandas' map
function is straightforward.
import pandas as pd
# Sample data with an ordinal feature
data = {'ID': [1, 2, 3, 4, 5],
'Satisfaction': ['Medium', 'Low', 'High', 'Medium', 'Low']}
df = pd.DataFrame(data)
# Define the desired order and mapping
satisfaction_mapping = {'Low': 0, 'Medium': 1, 'High': 2}
# Apply the mapping to the 'Satisfaction' column
df['Satisfaction_Encoded'] = df['Satisfaction'].map(satisfaction_mapping)
print(df)
This code produces:
ID Satisfaction Satisfaction_Encoded
0 1 Medium 1
1 2 Low 0
2 3 High 2
3 4 Medium 1
4 5 Low 0
This approach gives you full control over the assignment but requires manual definition for each ordinal feature.
OrdinalEncoder
Scikit-learn provides the OrdinalEncoder
class within its preprocessing
module. It can automatically determine categories or accept a predefined order. Crucially, for ordinal data, you should always specify the order via the categories
parameter to ensure the encoding reflects the true sequence, rather than an arbitrary one based on the order the categories appear in the data or alphabetically.
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
# Sample data
data = {'ID': [1, 2, 3, 4, 5],
'Satisfaction': ['Medium', 'Low', 'High', 'Medium', 'Low'],
'Size': ['M', 'S', 'XL', 'L', 'M']}
df = pd.DataFrame(data)
# Define the specific order for each ordinal column
satisfaction_order = ['Low', 'Medium', 'High']
size_order = ['S', 'M', 'L', 'XL']
# Initialize the encoder with the defined orders
# Note: categories is a list of lists, one inner list per feature
encoder = OrdinalEncoder(categories=[satisfaction_order, size_order])
# Apply the encoder
# Select the columns to encode in the correct order matching 'categories'
encoded_features = encoder.fit_transform(df[['Satisfaction', 'Size']])
# Create new columns in the DataFrame for the encoded features
df['Satisfaction_Encoded'] = encoded_features[:, 0]
df['Size_Encoded'] = encoded_features[:, 1]
print(df)
The output shows the encoded features:
ID Satisfaction Size Satisfaction_Encoded Size_Encoded
0 1 Medium M 1.0 1.0
1 2 Low S 0.0 0.0
2 3 High XL 2.0 3.0
3 4 Medium L 1.0 2.0
4 5 Low M 0.0 1.0
If you omit the categories
parameter, OrdinalEncoder
might assign integers alphabetically or based on appearance order, which would destroy the ordinal relationship (e.g., assigning 'High': 0, 'Low': 1, 'Medium': 2 alphabetically). Always define the order explicitly for meaningful ordinal encoding.
The bar chart visually represents the mapping defined for the 'Satisfaction' feature, clearly showing the numerical assignment based on the specified order.
Advantages:
Disadvantages:
Ordinal Encoding is most appropriate when:
While potentially useful for linear models if the numerical mapping happens to align well with the target variable's relationship, be cautious about the equal-spacing assumption. It's often less suitable for distance-based algorithms unless the numerical mapping truly reflects a meaningful distance between categories. Always consider the nature of your data and the requirements of your chosen machine learning algorithm when deciding if Ordinal Encoding is the right choice.
© 2025 ApX Machine Learning