Understanding how supervised learning updates a model's internal structure requires looking at the step-by-step training loop. Supervised Fine-Tuning takes a pre-trained model that already understands grammar and general facts and applies a weight update rule using a highly specific dataset of instruction-response pairs.
When you start the fine-tuning process, the model learns through an iterative cycle consisting of three primary phases: the forward pass, the loss calculation, and the backward pass.
During training, the model receives an input sequence and attempts to predict the next tokens. This is the forward pass. The raw text is converted into numerical tokens. These tokens pass through the model's transformer blocks, which apply self-attention mechanisms and feed-forward neural networks. The output is a probability distribution over the entire vocabulary, often referred to as logits.
Since this is supervised learning, the training script has access to the exact correct answer. The model's prediction is compared against the actual target tokens from your custom dataset. This comparison is quantified using a mathematical function known as the loss function. For language models, this is typically Cross-Entropy Loss.
In this equation, represents the number of classes, which is the vocabulary size of the model. The variable is the true probability distribution, which is a 1 for the correct token and 0 for all others. The variable is the predicted probability for the token . A lower calculated loss indicates that the model's predictions closely align with your target data.
Once the loss is calculated, the model performs the backward pass using backpropagation. This step calculates the gradient of the loss with respect to every single weight in the model. We represented this gradient as in the earlier weight update equation. The gradient indicates the direction and magnitude of change required for each parameter to reduce the overall loss.
This calculation relies on the chain rule from calculus. The mathematical error propagates from the final output layer all the way back to the initial input embeddings. During this phase, the system computes how much each specific weight contributed to the final error.
The cyclical process of supervised fine-tuning, from the forward pass generating predictions to the optimizer updating the model weights.
With the gradients computed, the optimizer is responsible for actually updating the model's weights. While standard gradient descent uses the basic formula, modern fine-tuning typically uses more advanced optimizers like AdamW.
The optimizer applies the learning rate, represented by , which controls how large of a step the model takes when updating weights. If the learning rate is too high, the model might overshoot the optimal weights and fail to converge. If it is too low, the training process will be unnecessarily slow. AdamW improves upon the basic formula by adjusting the learning rate dynamically for each parameter based on historical gradients, while also adding weight decay to prevent the model from memorizing the training data.
Because the backward pass requires storing intermediate activations from the forward pass, the memory overhead during supervised fine-tuning is significantly higher than during standard text generation. For every parameter in the model, the system must store the weight itself, the computed gradient, and the optimizer states.
This is exactly why training a small language model requires careful resource management. A model that takes up 4 GB of RAM to generate text might require 16 GB or more to train using standard supervised fine-tuning. By modifying only a small subset of these weights, we can drastically reduce the memory footprint.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•