Gradient Boosting trains models on the negative gradient of a loss function. This algorithm assembles various components into a complete, step-by-step process. Laying out the process sequentially clarifies how each new tree contributes to improving the overall ensemble.The algorithm iteratively refines its predictions. It starts with a simple initial guess and then, for a specified number of rounds, builds a new tree designed to correct the errors made by the current model.digraph G { rankdir=TB; node [shape=box, style="rounded,filled", fillcolor="#e9ecef", fontname="Helvetica"]; edge [fontname="Helvetica"]; subgraph cluster_0 { style=invis; init [label="1. Initialize Model\n(e.g., with mean of target)", fillcolor="#a5d8ff"]; loop_start [label="For each tree m = 1 to M:", shape=diamond, style=filled, fillcolor="#ffec99"]; compute_grad [label="2a. Compute Pseudo-Residuals\n(Negative Gradient)", fillcolor="#d0bfff"]; fit_tree [label="2b. Fit a Weak Learner\n(Decision Tree on Residuals)", fillcolor="#b2f2bb"]; update_model [label="2c. Update Ensemble Prediction\n(Add new tree's contribution)", fillcolor="#ffd8a8"]; final_model [label="3. Final Model", shape=ellipse, style=filled, fillcolor="#96f2d7"]; init -> loop_start; loop_start -> compute_grad [label=" Start Iteration"]; compute_grad -> fit_tree; fit_tree -> update_model; update_model -> loop_start [label=" Next Iteration"]; loop_start -> final_model [label=" M trees built"]; } }The iterative process of the Gradient Boosting Machine algorithm. Each cycle adds a new tree trained to correct the errors of the current ensemble.Let's break down each step in detail. For this explanation, we will focus on a regression problem using Mean Squared Error (MSE) as the loss function, where the process is most intuitive.Step 1: Initialize the Model with a Constant ValueBefore we can start correcting errors, we need an initial prediction. What is the best single prediction we can make for all samples in the absence of any features? For MSE, the constant value that minimizes the overall error is the mean of the target variable, $y$.So, our initial model, $F_0(x)$, is simply this average:$$ F_0(x) = \bar{y} $$This single value serves as our starting point or "zeroth" prediction for every observation in the training set.Step 2: Iterate and Build TreesNow, we enter a loop that will run for a predefined number of iterations, $M$, where each iteration adds one new tree to our model. Let's look at what happens inside the loop for iteration $m$ (from $1$ to $M$).2a. Compute the Pseudo-ResidualsThe core of Gradient Boosting is training new models on the error of the previous ones. As we established, this "error" is formally the negative gradient of the loss function. We call these values pseudo-residuals.For each sample $i$, the pseudo-residual $r_{im}$ is calculated based on the previous model's prediction, $F_{m-1}(x_i)$:$$ r_{im} = - \left[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \right]{F(x)=F{m-1}(x)} $$For an MSE loss function, where $L(y_i, F(x_i)) = \frac{1}{2}(y_i - F(x_i))^2$, this derivative simplifies beautifully. The negative gradient becomes:$$ r_{im} = y_i - F_{m-1}(x_i) $$This is simply the actual value minus the predicted value, which is the standard residual error. For the first iteration ($m=1$), the pseudo-residuals are just the target values minus the overall mean: $r_{i1} = y_i - \bar{y}$.2b. Fit a Base Learner (Decision Tree) to the Pseudo-ResidualsNext, we fit a new weak learner, which we'll call $h_m(x)$, to predict the pseudo-residuals we just calculated. This is a significant part of the process: the new tree is not trained to predict the original target $y$, but rather the current residual error.Input Features: The original features $X$ from our dataset.Target: The pseudo-residuals ${r_{1m}, r_{2m}, \dots, r_{Nm}}$ from step 2a.The resulting tree, $h_m(x)$, learns the relationship between the features and the residual error of the current model. For instance, if the model is consistently under-predicting for a certain group of samples, the tree will learn to output a positive value for them.2c. Update the Ensemble ModelWe now update our overall model by adding the new tree's predictions to our previous model. However, we don't just add the full prediction. We scale it by a small factor called the learning rate (often denoted by $\eta$, eta, or alpha).The update rule is:$$ F_m(x) = F_{m-1}(x) + \eta h_m(x) $$The learning rate is a regularization technique. A smaller learning rate shrinks the contribution of each individual tree, requiring more trees in the ensemble but often leading to better generalization. It prevents the model from changing too drastically with the addition of a single tree, making the learning process more stable. Typical values for $\eta$ are between 0.01 and 0.3.Step 3: Output the Final ModelAfter the loop completes (i.e., we have built all $M$ trees), our final model is the sum of the initial prediction and the scaled contributions from all the trees.The final prediction for a new observation is given by:$$ \hat{y} = F_M(x) = F_0(x) + \sum_{m=1}^{M} \eta h_m(x) $$This final model represents a sophisticated function that has been built up iteratively, with each component specializing in correcting the leftover errors from the sequence of models that came before it. By taking small, careful steps in the direction of the negative gradient, the model gradually reduces the overall loss.