While Ordered Target Statistics provide a powerful way to represent individual categorical features numerically, the predictive signal in many datasets often arises from the interaction between different features. For instance, knowing a user's browser might be somewhat informative, and knowing their operating system might also be helpful, but knowing they use Browser='Safari' and OS='macOS' simultaneously could be significantly more predictive than either piece of information alone.

Manually engineering such interaction features (e.g., creating a new feature like browser_os) is possible but quickly becomes challenging. With many categorical features, the number of potential combinations (pairs, triplets, etc.) explodes combinatorially, making exhaustive creation computationally infeasible and requiring significant domain expertise to identify potentially useful interactions.

CatBoost addresses this challenge by automatically generating and evaluating combinations of categorical features during the tree building process. This is not done upfront by creating a massive new feature set; instead, it's integrated greedily into the split selection logic.

Greedy Combination Generation

CatBoost constructs feature combinations on-the-fly. When building a tree and selecting the best split for a given node (or, in the case of Oblivious Trees, for a given level), CatBoost considers combinations involving:

One or more categorical features already used for splitting criteria in the nodes above the current node (i.e., on the path from the root).
A new categorical feature being considered for the current split.

Essentially, for each new split based on a categorical feature $C_{new}$ , CatBoost also evaluates splits based on combinations like $(C_{ancestor1}, C_{new})$ , $(C_{ancestor2}, C_{new})$ , $(C_{ancestor1}, C_{ancestor2}, C_{new})$ , and so on, where $C_{ancestor_i}$ are categorical features used in ancestor nodes' splits.

These newly formed combination features are treated just like any other categorical feature. They are numerically encoded using the same Ordered Target Statistics (Ordered TS) mechanism described previously, respecting the "ordering principle" to prevent target leakage. The algorithm then evaluates the potential gain (reduction in loss) from splitting on these combinations alongside the original features and selects the best split overall.

Integration with Oblivious Trees

This process integrates naturally with CatBoost's use of Oblivious Trees. Recall that Oblivious Trees use the same splitting criterion (feature and threshold/category) for all nodes at the same depth level. When CatBoost decides to use a feature combination for a split at a certain level, that specific combination (e.g., Country combined with Browser) is used across all nodes at that depth.

This means that combinations are built progressively as the tree grows deeper. A combination used at level 3 might involve categorical features used for splits at levels 1 and 2.

Example of how feature combinations are considered. When deciding the split for Level 2 nodes, CatBoost evaluates splitting on CatFeat_B directly, but also considers splitting on a derived feature representing the combination of the ancestor split (CatFeat_A) and CatFeat_B.

Controlling Combination Complexity

CatBoost generates combinations greedily and doesn't explore all possibilities exhaustively. It typically starts by considering combinations of two categorical features. If a two-feature combination proves effective (i.e., it's selected for a split), the algorithm might then consider combining it with another categorical feature to create a three-feature combination for splits further down the tree, and so on.

The maximum number of categorical features included in these automatic combinations can be controlled via the max_ctr_complexity parameter in the CatBoost library. The default usually considers combinations of up to two features (original feature + one combined feature). Increasing this allows the model to potentially capture higher-order interactions, but at the cost of increased computation and potential overfitting if not carefully tuned. Another related parameter, ctr_max_border_count, controls the number of splits considered for combinations involving numerical features.

Benefits of Automatic Combinations

Reduced Manual Effort: Automates a significant part of feature engineering, saving time and potentially finding interactions missed by manual exploration.
Improved Accuracy: Captures complex dependencies between categorical variables that are often predictive.
Native Integration: Works with Ordered TS and Oblivious Trees within the CatBoost framework.

By automatically handling feature combinations, CatBoost further enhances its ability to extract predictive power from datasets with numerous categorical features, making it a strong choice for such problems. This capability, combined with its categorical encoding methods, distinguishes it from other gradient boosting implementations.

Was this section helpful?