Handling categorical features effectively is a significant challenge in many machine learning tasks, especially within boosting frameworks. Traditional methods often fall short. One-hot encoding, while straightforward, becomes computationally expensive and can lead to sparsity issues when dealing with high-cardinality features (features with many unique categories). Another common approach, standard target encoding (also known as mean encoding), replaces each category with the average target value observed for that category in the training data. However, this method suffers from a critical flaw: target leakage. By using the target value of a data point to compute the feature encoding for that same data point, we inadvertently introduce information about the target into the features, leading to overly optimistic performance estimates during training and poor generalization to unseen data.
CatBoost introduces Ordered Target Statistics (Ordered TS) as an elegant and effective solution to this problem. The core idea is to calculate target statistics for each data point without using its own target value, thereby preventing direct leakage. This is achieved by imposing a specific order on the data.
Instead of using the entire dataset to calculate the target statistic for a given category, Ordered TS leverages an artificial "time" or order imposed on the training data. Here's how it works:
The formula for the Ordered Target Statistic for sample k and feature i is typically calculated with smoothing:
TSk,i=∑j=1k−1[xσ(j),i=xσ(k),i]+α∑j=1k−1[xσ(j),i=xσ(k),i]⋅yσ(j)+α⋅priorLet's break down this formula:
The term α⋅prior acts as a regularizer. It's particularly important early in the permutation (small k) or for rare categories, where there might be very few (or zero) preceding samples with the same category value. The smoothing prevents the encoded value from becoming too noisy or relying heavily on just one or two previous examples. It "pulls" the estimate towards the overall average target value when local information is scarce.
Dependency flow for calculating the Ordered Target Statistic for sample xσ(k). The calculation uses target values (y) only from preceding samples (j<k) in the permutation σ that share the same category for feature i, combined with a prior value for smoothing. The target yσ(k) is not used.
Relying on a single random permutation could potentially introduce its own biases depending on the specific order generated. To mitigate this, CatBoost typically generates multiple random permutations during the training of an ensemble. Different permutations are used when constructing different trees in the boosting model. This averaging effect across various orderings makes the resulting target statistics more stable and robust. This mechanism is closely linked to Ordered Boosting, which we will discuss in the next section, as it ensures that the model trained at each step doesn't suffer from prediction shift caused by the encoding process.
When predicting on new, unseen data (the test set), we cannot simply append it to the training permutation and calculate statistics, as this would violate the principle of using only past information. Instead, CatBoost typically uses the target statistics learned during training. For a given category observed in a test sample, the encoding might be calculated using all training samples belonging to that category (often still incorporating the prior and smoothing) or by using pre-calculated statistics stored during the training phase. The exact method can depend on the CatBoost implementation details and parameters.
Ordered Target Statistics represent a substantial improvement in how gradient boosting models can learn from categorical data, forming a foundation for CatBoost's strong performance, particularly on datasets rich in such features.
© 2025 ApX Machine Learning