While the exact greedy algorithm described previously guarantees finding the optimal split among all possible values for continuous features (given the current tree structure), its computational demands can become prohibitive for large datasets. Evaluating every unique value as a potential split point requires sorting the feature values at each node, an operation that scales poorly, typically with O(nlogn) complexity where n is the number of instances in the node. Furthermore, repeatedly accessing and sorting potentially large amounts of data can lead to cache misses and memory bottlenecks.
To address these scalability challenges, XGBoost introduces an approximate greedy algorithm. This approach significantly speeds up the split finding process, especially for large datasets, by reducing the number of candidate split points evaluated.
The Histogram-Based Strategy
The core idea behind the approximate algorithm is to discretize continuous features into a limited number of bins before training or at different stages during training. Instead of considering every unique feature value as a potential split point, the algorithm only considers split points between these discrete bins.
Here's how it works:
- Feature Binning (Quantization): For each continuous feature, the algorithm groups the sorted values into a fixed number of discrete bins. This is typically done using quantiles (percentiles) of the feature's distribution. For example, if we choose 256 bins, the algorithm finds approximately 255 split points that divide the data into 256 roughly equal-sized groups based on that feature's values.
- Aggregated Statistics: For each bin, XGBoost calculates and stores the sum of gradients (G) and Hessians (H) for all data points falling into that bin. Let Ik be the set of indices of data points whose feature value falls into bin k. Then, the aggregated statistics for bin k are:
Gk=∑i∈Ikgi
Hk=∑i∈Ikhi
- Evaluating Candidate Splits: The algorithm then evaluates potential splits only at the boundaries between bins. To calculate the gain of a split proposed between bin j and bin j+1, it efficiently uses the pre-computed sums. The total gradient and Hessian for the left child (L, containing bins 1...j) and right child (R, containing bins j+1...Nbins) are simply:
GL=∑k=1jGk,HL=∑k=1jHk
GR=∑k=j+1NbinsGk,HR=∑k=j+1NbinsHk
These aggregate values are then plugged into the standard gain formula:
Gain=21[HL+λGL2+HR+λGR2−HL+HR+λ(GL+GR)2]−γ
The algorithm finds the boundary between bins that maximizes this gain.
A continuous feature's values are mapped into discrete bins. Each bin aggregates the gradient (G) and Hessian (H) statistics of the instances it contains. Potential splits are evaluated only at the boundaries between these bins.
Proposal Methods: Global vs. Local Variants
XGBoost offers two primary variants for when these candidate split points (bin boundaries) are determined:
- Global Variant: The candidate split points are proposed once at the beginning of the entire tree construction process (or even before training starts). The same set of proposals (bin boundaries) is used for evaluating splits at all nodes throughout the growth of a specific tree. This is computationally cheaper as the binning process happens only once per feature.
- Local Variant: The candidate split points are re-proposed after each split. That is, for a given node, the algorithm considers only the data points present in that node, re-calculates the feature quantiles based on this subset, determines new bin boundaries, and proposes splits based on these locally relevant boundaries. This is more computationally intensive as binning happens repeatedly, but it can potentially lead to more accurate splits, especially deeper in the tree where the data distribution might differ significantly from the global distribution.
The choice between global and local variants involves a trade-off:
- Global: Faster training, less memory usage (proposals stored once). May be less accurate if data distributions change significantly within different branches of the tree.
- Local: More computationally expensive, potentially higher memory usage. Adapts better to local data distributions, potentially leading to better accuracy, especially for deeper trees.
In practice, the global variant is often sufficient and provides significant speedups. The number of bins acts as a regularization parameter itself; fewer bins lead to coarser splits and potentially faster, more regularized models, while more bins allow finer splits, approaching the exact greedy method at the cost of increased computation. The parameter controlling this granularity in XGBoost is often related to sketch_eps
(epsilon), where roughly 1 / sketch_eps
gives the number of bins.
By using histograms and evaluating splits only between bins, the approximate greedy algorithm changes the split finding complexity from being dependent on the number of unique values to being dependent on the number of bins, which is typically much smaller. This makes XGBoost significantly faster and more memory-efficient, enabling it to handle datasets that are too large for the exact greedy algorithm.