Supervised Fine-Tuning Mechanics

Understanding how supervised learning updates a model's internal structure requires looking at the step-by-step training loop. Supervised Fine-Tuning takes a pre-trained model that already understands grammar and general facts and applies a weight update rule using a highly specific dataset of instruction-response pairs.

When you start the fine-tuning process, the model learns through an iterative cycle consisting of three primary phases: the forward pass, the loss calculation, and the backward pass.

The Forward Pass and Loss Calculation

During training, the model receives an input sequence and attempts to predict the next tokens. This is the forward pass. The raw text is converted into numerical tokens. These tokens pass through the model's transformer blocks, which apply self-attention mechanisms and feed-forward neural networks. The output is a probability distribution over the entire vocabulary, often referred to as logits.

Since this is supervised learning, the training script has access to the exact correct answer. The model's prediction is compared against the actual target tokens from your custom dataset. This comparison is quantified using a mathematical function known as the loss function. For language models, this is typically Cross-Entropy Loss.

$L = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)$

In this equation, $C$ represents the number of classes, which is the vocabulary size of the model. The variable $y_i$ is the true probability distribution, which is a 1 for the correct token and 0 for all others. The variable $\hat{y}_i$ is the predicted probability for the token $i$ . A lower calculated loss indicates that the model's predictions closely align with your target data.

Backpropagation and Gradients

Once the loss is calculated, the model performs the backward pass using backpropagation. This step calculates the gradient of the loss with respect to every single weight in the model. We represented this gradient as $\nabla L(w_t)$ in the earlier weight update equation. The gradient indicates the direction and magnitude of change required for each parameter to reduce the overall loss.

This calculation relies on the chain rule from calculus. The mathematical error propagates from the final output layer all the way back to the initial input embeddings. During this phase, the system computes how much each specific weight contributed to the final error.

The cyclical process of supervised fine-tuning, from the forward pass generating predictions to the optimizer updating the model weights.

Optimizer and Weight Updates

With the gradients computed, the optimizer is responsible for actually updating the model's weights. While standard gradient descent uses the basic $w_{t+1} = w_t - \alpha \nabla L(w_t)$ formula, modern fine-tuning typically uses more advanced optimizers like AdamW.

The optimizer applies the learning rate, represented by $\alpha$ , which controls how large of a step the model takes when updating weights. If the learning rate is too high, the model might overshoot the optimal weights and fail to converge. If it is too low, the training process will be unnecessarily slow. AdamW improves upon the basic formula by adjusting the learning rate dynamically for each parameter based on historical gradients, while also adding weight decay to prevent the model from memorizing the training data.

Because the backward pass requires storing intermediate activations from the forward pass, the memory overhead during supervised fine-tuning is significantly higher than during standard text generation. For every parameter in the model, the system must store the weight itself, the computed gradient, and the optimizer states.

This is exactly why training a small language model requires careful resource management. A model that takes up 4 GB of RAM to generate text might require 16 GB or more to train using standard supervised fine-tuning. By modifying only a small subset of these weights, we can drastically reduce the memory footprint.

References

Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017 Advances in Neural Information Processing Systems, Vol. 30 DOI: 10.48550/arXiv.1706.03762 - The foundational paper for the Transformer architecture, explaining the self-attention mechanism mentioned in the forward pass section.
Decoupled Weight Decay Regularization, Ilya Loshchilov, Frank Hutter, 2019 International Conference on Learning Representations (ICLR) - The original paper introducing AdamW, the optimizer used in modern fine-tuning to handle weight decay and learning rate adaptation.
Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - A standard academic textbook providing the mathematical foundations of backpropagation, gradients, and cross-entropy loss.
CS224N: Natural Language Processing with Deep Learning, Christopher Manning, 2023 (Stanford University) - A comprehensive university course covering the mechanics of training language models, including tokenization and loss calculation.