Building upon the regularized learning objective introduced earlier, XGBoost needs an effective method to construct the decision trees that form the ensemble. At the heart of this process lies the split finding algorithm. For each potential node in a tree being built, the algorithm must determine the best feature and the best split point for that feature to partition the data. The "best" split is defined as the one that maximizes the reduction in the objective function value, often referred to as the gain.
XGBoost employs several strategies for finding splits, and the most fundamental one is the Exact Greedy Algorithm. It's termed "exact" because it meticulously evaluates every possible split point for every feature among the instances currently residing in that node. It's "greedy" because it makes the locally optimal choice at each node, selecting the split with the highest gain without considering the impact on subsequent splits further down the tree.
Recall the XGBoost objective function at step t, incorporating the complexity penalty:
Obj(t)≈∑i=1n[gift(xi)+21hift2(xi)]+Ω(ft)
Where gi and hi are the first and second-order gradients of the loss function with respect to the prediction for instance i at step t−1. ft(xi) is the prediction of the new tree for instance i, and Ω(ft)=γT+21λ∑j=1Twj2 is the regularization term, with T being the number of leaves and wj the weight of leaf j.
For a fixed tree structure, the optimal weight wj∗ for a leaf j containing the set of instances Ij is:
wj∗=−∑i∈Ijhi+λ∑i∈Ijgi
And the corresponding optimal objective value contributed by this tree structure is:
Obj∗=−21∑j=1T∑i∈Ijhi+λ(∑i∈Ijgi)2+γT
Let's define Gj=∑i∈Ijgi and Hj=∑i∈Ijhi. The objective becomes:
Obj∗=−21∑j=1THj+λGj2+γT
Now, consider splitting a leaf (node) containing instances I into two children: a left child with instances IL and a right child with instances IR, such that I=IL∪IR. The gain from this split is the objective value for the parent node (if it remained a leaf) minus the sum of the objective values for the two new child leaves.
The objective value if the node I is not split (remains a single leaf) is: Objnosplit=−21HL+HR+λ(GL+GR)2+γ
The objective value if the node is split into IL and IR is: Objsplit=−21(HL+λGL2+HR+λGR2)+2γ
The gain is defined as Gain=Objnosplit−Objsplit. However, XGBoost literature often defines gain slightly differently, focusing purely on the improvement in the first part of the objective function and subtracting the penalty γ for adding a leaf (since splitting replaces one leaf with two, adding complexity):
Gain=21[HL+λGL2+HR+λGR2−HL+HR+λ(GL+GR)2]−γ
This formula is the core calculation used by the exact greedy algorithm. A split is considered beneficial only if the resulting gain is positive (or greater than a minimum threshold parameter, min_split_loss
, also known as gamma
in the API). The term γ acts as a regularization parameter, controlling the complexity by setting a minimum gain required to make a split.
The algorithm proceeds as follows for each node:
max_gain = 0
.The following diagram illustrates the process for finding the best split for a single feature:
Iterative process within the exact greedy algorithm for a single feature. Instances are sorted, and each potential split point is evaluated based on the gain calculation. The best split across all features determines the node's fate.
The main computational bottleneck in the exact greedy algorithm is sorting the feature values at each node. If there are n instances reaching a node and d features:
While thorough, this enumeration can become computationally demanding, especially when dealing with:
This computational cost motivates the development of approximate split finding algorithms, which we will discuss in the next section. They trade off some accuracy in finding the absolute best split for significant gains in computational efficiency, making XGBoost practical for very large datasets.
The exact greedy algorithm forms the foundation for tree construction in XGBoost. Its strength lies in its exhaustive search for the locally optimal split based on the regularized objective function. Understanding this process is important for appreciating how XGBoost builds highly effective trees and why alternative methods like approximate algorithms are sometimes necessary.
© 2025 ApX Machine Learning