Before applying any encoding technique, it's significant to understand that categorical features themselves come in different flavors. Broadly, we can classify them into two main types: nominal and ordinal. Recognizing this distinction is fundamental because the optimal encoding strategy often depends on the type of categorical data you're working with. Applying an inappropriate encoding method can either inject misleading information into your dataset or fail to capture valuable structural information present in the feature.
Nominal categories represent distinct groups or labels where no inherent order or ranking exists among the values. Think of them as qualitative classifications. Examples include:
In nominal features, any numerical assignment to the categories would be purely arbitrary. Assigning 1
to 'Red', 2
to 'Green', and 3
to 'Blue' doesn't imply that Blue is "greater than" Green, or that the difference between Red and Green is the same as between Green and Blue. Machine learning algorithms, particularly linear models or distance-based algorithms (like k-Nearest Neighbors), might misinterpret such numerical assignments, assuming an order or magnitude difference that simply isn't present in the original data. This can lead to incorrect assumptions and potentially degrade model performance.
Ordinal categories, on the other hand, possess a meaningful order or ranking among their values, but the magnitude of difference between consecutive categories is not necessarily known, uniform, or quantifiable. The order matters, but arithmetic operations on the categories are generally meaningless. Examples include:
Here, we know that 'Master's' is higher than 'Bachelor's', and 'Satisfied' represents a better outcome than 'Neutral'. However, we usually cannot assume that the "gap" between 'High School' and 'Bachelor's' is the same as the gap between 'Master's' and 'PhD'. Assigning numerical values like 1, 2, 3, 4
to these levels captures the order, which can be valuable information for some models. However, one must still be cautious, as the model might interpret the numerical differences literally (e.g., assume the difference between level 1 and 2 is exactly the same as between level 3 and 4).
Understanding whether a feature is nominal or ordinal directly influences your choice of encoding strategy:
['USA', 'Canada', 'Mexico']
as [1, 2, 3]
might lead a linear model to incorrectly infer that Mexico has three times some property represented by USA.feature <= 2
). However, linear models, SVMs, and neural networks are more sensitive to the numerical scale and implied relationships between encoded values.Identifying whether a categorical feature is nominal or ordinal guides the selection of appropriate encoding techniques.
Identifying the type often requires examining the unique values of the feature and applying domain knowledge about what those values represent. Is there a logical progression or hierarchy? If yes, it's likely ordinal; otherwise, it's nominal.
With this distinction in mind, the following sections will present specific encoding techniques, discussing which data type(s) they are best suited for and illustrating their implementation using common Python libraries. We'll start with methods typically used for nominal data before moving on to techniques that can handle ordinal relationships or address challenges like high cardinality.
© 2025 ApX Machine Learning