While Ordered Target Statistics provide a powerful way to represent individual categorical features numerically, the predictive signal in many datasets often arises from the interaction between different features. For instance, knowing a user's browser might be somewhat informative, and knowing their operating system might also be helpful, but knowing they use Browser='Safari'
and OS='macOS'
simultaneously could be significantly more predictive than either piece of information alone.
Manually engineering such interaction features (e.g., creating a new feature like browser_os
) is possible but quickly becomes challenging. With many categorical features, the number of potential combinations (pairs, triplets, etc.) explodes combinatorially, making exhaustive creation computationally infeasible and requiring significant domain expertise to identify potentially useful interactions.
CatBoost addresses this challenge by automatically generating and evaluating combinations of categorical features during the tree building process. This is not done upfront by creating a massive new feature set; instead, it's integrated greedily into the split selection logic.
CatBoost constructs feature combinations on-the-fly. When building a tree and selecting the best split for a given node (or, in the case of Oblivious Trees, for a given level), CatBoost considers combinations involving:
Essentially, for each new split based on a categorical feature Cnew, CatBoost also evaluates splits based on combinations like (Cancestor1,Cnew), (Cancestor2,Cnew), (Cancestor1,Cancestor2,Cnew), and so on, where Cancestori are categorical features used in ancestor nodes' splits.
These newly formed combination features are treated just like any other categorical feature. They are numerically encoded using the same Ordered Target Statistics (Ordered TS) mechanism described previously, respecting the "ordering principle" to prevent target leakage. The algorithm then evaluates the potential gain (reduction in loss) from splitting on these combinations alongside the original features and selects the best split overall.
This process integrates naturally with CatBoost's use of Oblivious Trees. Recall that Oblivious Trees use the same splitting criterion (feature and threshold/category) for all nodes at the same depth level. When CatBoost decides to use a feature combination for a split at a certain level, that specific combination (e.g., Country
combined with Browser
) is used across all nodes at that depth.
This means that combinations are built progressively as the tree grows deeper. A combination used at level 3 might involve categorical features used for splits at levels 1 and 2.
Example of how feature combinations are considered. When deciding the split for Level 2 nodes, CatBoost evaluates splitting on
CatFeat_B
directly, but also considers splitting on a derived feature representing the combination of the ancestor split (CatFeat_A
) andCatFeat_B
.
CatBoost generates combinations greedily and doesn't explore all possibilities exhaustively. It typically starts by considering combinations of two categorical features. If a two-feature combination proves effective (i.e., it's selected for a split), the algorithm might then consider combining it with another categorical feature to create a three-feature combination for splits further down the tree, and so on.
The maximum number of categorical features included in these automatic combinations can be controlled via the max_ctr_complexity
parameter in the CatBoost library. The default usually considers combinations of up to two features (original feature + one combined feature). Increasing this allows the model to potentially capture higher-order interactions, but at the cost of increased computation and potential overfitting if not carefully tuned. Another related parameter, ctr_max_border_count
, controls the number of splits considered for combinations involving numerical features.
By automatically handling feature combinations, CatBoost further enhances its ability to extract predictive power from datasets with numerous categorical features, making it a particularly strong choice for such problems. This capability, combined with its robust categorical encoding methods, distinguishes it from other gradient boosting implementations.
© 2025 ApX Machine Learning