In the previous section, we saw how Ordered Target Statistics (Ordered TS) provides a principled way to encode categorical features by relying only on the observed history within the data. This technique significantly reduces the risk of target leakage compared to naive target encoding methods. However, a subtle issue known as prediction shift can still arise during the standard gradient boosting training process.
Prediction shift occurs because the gradient (or residual) calculated for a specific training example at iteration m depends on the model Fm−1 built using all preceding iterations. If target statistics were used to build the trees in Fm−1, the model has implicitly gained information about the target variable of the current example, even if Ordered TS was used for the encoding itself. The model updating process itself can re-introduce bias. Specifically, when calculating the residual for example xi, the model has already been influenced by example i's target value through the target statistics used in previous boosting steps.
CatBoost introduces Ordered Boosting to directly address this prediction shift. Instead of training a single sequence of models on the original dataset, Ordered Boosting leverages random permutations of the training data.
Consider the standard boosting process where the model Fm is built sequentially:
Fm(x)=Fm−1(x)+α⋅hm(x)Here, hm(x) is the tree trained at step m, typically fitted to the negative gradient (residuals) rm−1 calculated using Fm−1. The issue arises when rm−1 for example xi is calculated using Fm−1, a model which might have been influenced by yi via target encoding in steps 1,...,m−1.
Ordered Boosting modifies this process. It works as follows:
This procedure ensures that when calculating the gradient estimate for a given example, the model used for this calculation has not been influenced by the target value of that same example. It mimics the situation during inference, where the target is unknown.
Let's visualize this with a simplified concept for a single permutation σ.
Simplified view of Ordered Boosting for one permutation σ at boosting step m. The model used to calculate the residual for example xσ(i) is built using only examples preceding it in the permutation (xσ(1) to xσ(i−1)).
Ordered Boosting complements Ordered TS. Ordered TS ensures the feature encoding for an example xi uses target information only from preceding examples in a permutation. Ordered Boosting ensures the gradient calculation for xi uses a model trained only on preceding examples in that permutation. Together, they provide a robust defense against target leakage contaminating both the feature representation and the model update steps.
While maintaining S separate models seems computationally intensive, CatBoost employs efficient implementation techniques, particularly leveraging the structure of its Oblivious Trees (discussed in the "Oblivious Trees" section), to make Ordered Boosting practical.
The primary benefit of Ordered Boosting is a significant reduction in prediction shift, leading to models that generalize better from training to unseen data. This is especially impactful when dealing with datasets containing strong categorical predictors where target leakage can easily lead to inflated performance metrics during training that do not translate to real-world performance. The resulting models are generally more reliable.
The main trade-off is potentially increased training time compared to standard gradient boosting implementations that do not employ such a mechanism. However, CatBoost's overall optimizations, including GPU acceleration and efficient handling of categorical data, often make it competitive or even faster in practice, especially on datasets where its specialized features are advantageous.
By tackling prediction shift head-on, Ordered Boosting, in conjunction with Ordered TS, represents a substantial advancement in how gradient boosting algorithms handle categorical data, contributing considerably to CatBoost's reputation for accuracy and robustness out-of-the-box.
© 2025 ApX Machine Learning