As highlighted in the chapter introduction, training gradient boosting models on large datasets can be computationally intensive. A significant portion of this cost comes from iterating through all data instances to evaluate potential splits when building each tree. Gradient-based One-Side Sampling (GOSS) is a core innovation within LightGBM designed to accelerate this process by intelligently reducing the number of data instances used for finding the best splits.
The fundamental insight behind GOSS is that not all training instances contribute equally to the learning process at each boosting iteration. Recall that gradient boosting works by sequentially fitting weak learners (typically decision trees) to the negative gradients of the loss function concerning the current model's predictions. The gradient for an instance essentially represents the residual error or how "wrong" the current ensemble is for that specific instance.
Instances with large gradients are those for which the model is currently making significant errors. These instances are, in a sense, "under-trained" and provide substantial information about how to improve the model in the next iteration. Conversely, instances with small gradients are already well-predicted by the current ensemble; further refinement based heavily on these instances might lead to overfitting on minor variations and yield diminishing returns in terms of overall model improvement, while still incurring computational cost.
GOSS leverages this observation by proposing a non-uniform sampling strategy. Instead of treating all instances equally or using simple random sampling, GOSS prioritizes instances with larger gradients while maintaining a representative sample of instances with smaller gradients. This achieves computational savings without drastically altering the data distribution required for effective learning.
The GOSS procedure can be broken down into the following steps during the training of each tree:
top_rate
in LightGBM.other_rate
in LightGBM.The information gain calculation for finding the best split in a decision tree relies heavily on the sum of gradients and hessians (or just gradients, depending on the specific objective) within the potential child nodes. By sampling the low-gradient instances (set B) with a probability b, we reduce their count. To compensate for this reduction and ensure the expected value of the sum of gradients in set B approximates the sum over the original low-gradient population it represents, we need to scale the sampled gradients upwards. The factor b1−a precisely achieves this, scaling the sampled gradients gi for i∈B so that their sum statistically represents the sum over the entire pool of (1−a)×N low-gradient instances they were sampled from.
Imagine the distribution of absolute gradient magnitudes across your dataset. GOSS effectively keeps all instances above a certain threshold (the top 'a' fraction) and then randomly samples from the instances below that threshold, scaling their influence back up.
Illustration of GOSS selection. Instances with high gradients (red squares) are always kept. Instances with low gradients (gray bars) are randomly sampled (blue circles), and their gradients are amplified during gain calculation. Instances represented by gray bars without blue circles are discarded for this tree's split finding.
It's important to distinguish GOSS from the subsampling techniques used in traditional Stochastic Gradient Boosting (as implemented in Scikit-learn or standard XGBoost). Stochastic Gradient Boosting typically performs uniform random sampling of instances (row subsampling) before building each tree. GOSS, in contrast, performs biased sampling based on gradient magnitudes. It assumes that instances with smaller gradients are less critical for defining the optimal split structure and can be sampled more sparsely, provided their influence is correctly scaled.
When using LightGBM, GOSS is enabled by default as part of the 'boosting_type' (or 'boosting') parameter set to 'gbdt'. The specific behavior is controlled by:
top_rate
(a): The fraction of instances with the largest gradients to keep. Default is often 0.2. Increasing this retains more high-gradient data but reduces the speed gain.other_rate
(b): The fraction of the remaining (low-gradient) instances to sample. Default is often 0.1. Increasing this uses more low-gradient data, potentially improving accuracy slightly at the cost of speed.These parameters usually don't require extensive tuning but can be adjusted if you suspect GOSS is negatively impacting accuracy on a specific problem or if you need to maximize training speed further.
In summary, Gradient-based One-Side Sampling is a clever technique that significantly contributes to LightGBM's efficiency by reducing the computational burden of finding splits, focusing computational effort on the data points that matter most for improving the model at each step.
© 2025 ApX Machine Learning