Building upon the introduction, let's examine the foundational principles that make knowledge distillation (KD) an effective technique for model compression, particularly in the context of large language models. The central idea, originating from the work of Hinton, Vinyals, and Dean (2015), goes beyond simply training a smaller model on the same data; it involves transferring the nuanced "knowledge" captured by a proficient, but cumbersome, teacher model to a more compact student model.
At its heart, KD leverages the concept that a trained teacher model's output distribution contains richer information than just the ground-truth labels used during its initial training. Consider a standard classification task. Training typically uses one-hot encoded vectors (hard targets), where the correct class has a probability of 1 and all others have 0. However, the teacher model, after training, produces a probability distribution over all possible outputs. For instance, when predicting the next word in a sequence, the teacher might assign a high probability to the correct word, but also assign small, non-zero probabilities to other plausible (or even semantically related but incorrect) words. This richer distribution, often referred to as "dark knowledge," reveals similarities and relationships between outputs that the teacher has learned.
The standard KD approach captures this dark knowledge by training the student model to mimic the teacher's softened output probabilities. This is achieved using a temperature scaling parameter, T, applied within the softmax function. For a model producing logits z, the standard softmax probability for class i is pi=∑jexp(zj)exp(zi). With temperature scaling, the softened probability qi becomes:
qi=∑jexp(zj/T)exp(zi/T)A higher temperature T>1 softens the probability distribution, increasing the entropy and giving more weight to smaller logit values (the "dark knowledge"). As T→∞, the distribution approaches uniform, while T=1 recovers the standard softmax.
The student model is then trained using a loss function that encourages its softened outputs (calculated using the same temperature T) to match the teacher's softened outputs. The Kullback-Leibler (KL) divergence is commonly used for this purpose:
LKD=T2⋅DKL(qstudent∣∣qteacher)Here, qstudent and qteacher are the softened probability distributions from the student and teacher models, respectively. The T2 scaling factor ensures that the gradients remain comparable in magnitude as the temperature changes. Note that the teacher model's parameters are frozen during this process; only the student model is being trained.
Often, this distillation loss is combined with a standard loss function (e.g., cross-entropy) computed between the student's predictions (using T=1) and the true hard labels (y). This helps ensure the student still performs well on the original task objective. The final loss function becomes a weighted average:
Ltotal=αLCE(pstudent,y)+(1−α)LKD(qstudent,qteacher)The hyperparameter α (typically between 0 and 1) balances the contribution of the hard-target loss (LCE) and the soft-target distillation loss (LKD). Choosing appropriate values for T and α is essential for successful distillation and often requires empirical tuning.
Basic Knowledge Distillation process using soft targets. The teacher model generates soft targets using temperature scaling, which the student model tries to mimic via the KD loss (LKD). The student also learns from the ground truth via a standard cross-entropy loss (LCE).
The teacher and student models don't necessarily need identical architectures, although this is common when distilling LLMs (e.g., distilling a 70B parameter Llama model to a 7B Llama model). The primary requirement is that their output layers are compatible for calculating the distillation loss. The student learns to approximate the complex function learned by the teacher, guided by the rich supervisory signal contained in the soft targets. This often leads to better generalization and performance for the student model compared to training it solely on hard targets from scratch, effectively transferring the inductive biases learned by the large teacher model.
While matching the final output distribution is the classic approach, it's important to recognize that this represents only one facet of knowledge transfer. Subsequent sections will examine more advanced distillation objectives that leverage intermediate representations and attention mechanisms, providing alternative pathways to imbue the student model with the teacher's capabilities.
© 2025 ApX Machine Learning