Having recapped ensemble methods in general, let's examine the specific type of model overwhelmingly used as the building block within gradient boosting frameworks: the decision tree. While other models could theoretically serve as base learners, decision trees, particularly Classification and Regression Trees (CART), offer a compelling combination of properties that make them exceptionally well-suited for the additive, error-correcting nature of boosting.
Decision trees partition the feature space into rectangular regions and assign a constant prediction value within each region (specifically, within each leaf node). This inherent structure provides several advantages in the context of boosting:
Capturing Non-linearities and Interactions: Unlike linear models, trees naturally model non-linear relationships between features and the target variable. Furthermore, the sequential nature of splits allows trees to capture complex interaction effects. For example, a split on feature A
followed by a split on feature B
effectively isolates a region defined by conditions on both A
and B
, implicitly modeling their interaction. Boosting leverages this capability by sequentially adding trees that refine predictions in different regions of the feature space, capturing intricate patterns.
Handling Diverse Data Types: Standard CART algorithms can handle both numerical (continuous or discrete) and categorical features natively during the splitting process, although some boosting library implementations might require preprocessing like one-hot encoding for categorical variables initially. More advanced libraries like LightGBM and CatBoost incorporate specialized, highly efficient methods for handling categorical features directly, a topic we will cover in detail later.
Computational Considerations: Finding optimal splits in trees involves scanning features and potential split points. While computationally intensive for deep trees on large datasets, algorithms have been developed (exact greedy, approximate methods, histogram-based approaches) that make training reasonably efficient, especially for the relatively simple trees used in boosting. These algorithms are readily parallelizable to a certain extent.
Interpretability (of Individual Trees): While a large boosted ensemble with hundreds or thousands of trees is complex, a single, shallow decision tree is relatively easy to visualize and understand. This property is helpful during the development and debugging phases, allowing inspection of what each individual tree contributes to the ensemble.
Boosting iteratively combines multiple "weak learners" to produce a single "strong learner". A weak learner is defined as a model that performs only slightly better than random guessing on the classification or regression task. In the context of decision trees, this translates to using shallow trees, often referred to as "stumps" (trees with only one split, i.e., depth 1) or trees with very limited depth (e.g., depth 2 to 8).
Why weak learners?
Common hyperparameters used to control tree complexity and keep them "weak" include:
max_depth
: The maximum depth allowed for any individual tree.min_samples_split
: The minimum number of samples required to consider splitting an internal node.min_samples_leaf
: The minimum number of samples required to be present in a leaf node.max_leaf_nodes
: The maximum number of terminal nodes (leaves) a tree can have.This contrasts sharply with algorithms like Random Forests, which typically benefit from using deep, fully grown (or nearly fully grown) decision trees as base learners, relying on averaging predictions across many uncorrelated trees to reduce variance.
While standard decision trees are trained to predict the target variable y directly, trees within a gradient boosting ensemble are trained differently. At each iteration m, a new tree hm(x) is trained to predict the pseudo-residuals from the previous stage.
For common loss functions like squared error in regression, the pseudo-residuals are simply the negative gradients of the loss function with respect to the model's prediction at the previous step, which conveniently turn out to be the actual residuals: rim=yi−Fm−1(xi), where Fm−1(x) is the ensemble's prediction after m−1 iterations.
For other loss functions (like Log Loss for classification), the pseudo-residuals represent the negative gradient, guiding the tree construction towards directions that maximally reduce the overall loss. The tree splitting criteria (like variance reduction for regression trees) are then applied using these pseudo-residuals as the target values.
Consider a simple regression example:
Each tree focuses on explaining the remaining error unexplained by the current ensemble.
A simplified view of a single regression tree grown on pseudo-residuals within a boosting iteration. Splits aim to minimize the variance of the residuals within the resulting nodes.
Individual decision trees suffer from certain drawbacks:
Gradient boosting effectively mitigates these issues:
In summary, decision trees, particularly shallow CART implementations, provide the necessary flexibility to model complex functions and interactions, can handle different data types, and can be trained reasonably efficiently. Their tendency towards high variance when grown deep is purposefully constrained in boosting by keeping them "weak". They are trained not on the original target but on the residuals or gradients of the loss function, allowing the boosting ensemble to sequentially correct errors and build a powerful predictive model. This foundation is critical for appreciating the specific optimizations and enhancements introduced by algorithms like XGBoost, LightGBM, and CatBoost.
© 2025 ApX Machine Learning