All Courses

Hands-on Practical: Distilling a Generative LLM

Building upon the principles outlined earlier in this chapter, we now transition to the practical implementation of distilling knowledge from a large generative language model (the teacher) into a smaller, more efficient counterpart (the student). The objective is to create a student model that retains a significant portion of the teacher's generative capabilities while being substantially smaller and faster. This section provides a step-by-step guide, focusing on the important implementation details and evaluation strategies pertinent to generative models.

Setting the Stage: Models, Data, and Tools

Before initiating the distillation process, careful preparation is necessary.

Teacher Model Selection: Choose a pre-trained generative LLM to serve as the teacher. This could be a model like GPT-3.5, LLaMA-7B, or even a fine-tuned version specialized for a particular style or domain. For this practical guide, let's assume we are distilling knowledge from a TeacherLM-7B (7 billion parameters). Accessing the teacher model requires loading its weights and architecture, typically using libraries like Hugging Face Transformers.
Student Model Architecture: Define or select the architecture for the student model. It should be significantly smaller than the teacher, for example, a StudentLM-1B (1 billion parameters). Crucially, the student's architecture should be compatible with the teacher's output format (e.g., both producing logits over the same vocabulary). While the number of layers and hidden dimensions will differ, the core generative mechanism (e.g., transformer decoder) should be similar.
Distillation Dataset: The choice of data is significant. Often, the original pre-training dataset used for the teacher is effective. Alternatively, a large, diverse dataset of prompts or instructions covering the desired capabilities of the student model can be used. The point is that the data should elicit the knowledge you want to transfer from the teacher. No ground-truth labels are strictly necessary if solely relying on the teacher's outputs, but they can be incorporated if available.
Environment Setup: Ensure you have the necessary libraries installed, primarily PyTorch or TensorFlow, along with the Hugging Face transformers, datasets, and potentially accelerate for efficient training.

# Setup
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer

# Load Teacher Model (ensure it's in eval mode and requires no gradients)
teacher_model_name = "path/to/large/teacher/model"
teacher_tokenizer = AutoTokenizer.from_pretrained(teacher_model_name)
teacher_model = AutoModelForCausalLM.from_pretrained(teacher_model_name)
teacher_model.eval()
for param in teacher_model.parameters():
    param.requires_grad = False

# Load or Define Student Model
student_model_name = "path/to/smaller/student/config_or_model" # Or define architecture
student_tokenizer = AutoTokenizer.from_pretrained(student_model_name) # Often same as teacher
student_model = AutoModelForCausalLM.from_pretrained(student_model_name) # Or initialize from config

# Load Dataset
# dataset = load_dataset(...)

Designing the Distillation Process

The core of knowledge distillation lies in the loss function that guides the student's training.

The Distillation Loss

As introduced earlier, a common approach combines a standard language modeling loss (if ground truth targets are available) with a knowledge distillation loss that encourages the student to mimic the teacher's output distribution.

Soft Labels (KL Divergence): The primary KD loss minimizes the KL divergence between the teacher's and student's probability distributions over the vocabulary. Temperature scaling is applied to soften the distributions, preventing the model from becoming overconfident in a single token and providing richer supervisory signals.

The loss for a single token prediction is:
$L_{KD} = T^2 \cdot D_{KL}(\sigma(z_S / T) || \sigma(z_T / T))$
where $z_S$ and $z_T$ are the logits produced by the student and teacher models respectively, $T$ is the temperature (typically $T > 1$ ), and $\sigma$ denotes the softmax function. Averaging this loss across the sequence length and batch gives the final KD loss component.
Hard Labels (Cross-Entropy): If the distillation dataset includes ground truth next tokens (e.g., during continued pre-training or fine-tuning), the standard cross-entropy loss ( $L_{CE}$ ) can be used alongside the KD loss. This grounds the student model in the actual task data.
$L_{CE} = -\sum_{i} y_i \log(\sigma(z_S)_i)$
where $y_i$ is the one-hot encoded ground truth label for the $i$ -th token.
Combined Loss: The final loss function is typically a weighted sum of the cross-entropy loss (if used) and the KL divergence loss:
$L_{Total} = (1 - \alpha) L_{CE} + \alpha L_{KD}$
Here, $\alpha$ is a hyperparameter (between 0 and 1) balancing the influence of the ground truth labels and the teacher's soft labels. Choosing the right $\alpha$ and temperature $T$ often requires experimentation.

Matching Intermediate Representations

For deeper knowledge transfer, especially when architectural differences exist, matching intermediate representations can be beneficial.

Hidden States: Encourage the student's hidden states ( $h_S$ ) at certain layers to be close to the teacher's hidden states ( $h_T$ ) at corresponding layers. Since dimensions might differ, a linear projection layer ( $W_p$ ) might be needed to map the teacher's states to the student's dimension. The loss is often Mean Squared Error (MSE): $L_{Hidden} = \sum_{l \in L_{match}} \beta_l \cdot MSE(h_{S,l}, W_p(h_{T,l}))$ where $L_{match}$ is the set of matched layers, and $\beta_l$ are weighting factors.
Attention Patterns: Similarly, one can encourage the student's attention maps ( $A_S$ ) to mimic the teacher's attention maps ( $A_T$ ), potentially after pooling or projection.

Incorporating these intermediate losses adds complexity but can significantly improve the student's grasp of patterns learned by the teacher. The overall loss becomes a weighted sum of $L_{CE}$ , $L_{KD}$ , and any intermediate matching losses.

The Training Pipeline

The training loop needs modification to accommodate the teacher model and the custom loss function.

Forward Pass Mechanics

For each input batch:

Teacher Forward Pass: Pass the input tokens through the teacher model to obtain its logits ( $z_T$ ) and potentially intermediate hidden states ( $h_T$ ) or attention maps ( $A_T$ ). This pass does not compute gradients for the teacher.
Student Forward Pass: Pass the same input tokens through the student model to get its logits ( $z_S$ ), hidden states ( $h_S$ ), and attention maps ( $A_S$ ). This pass does track gradients for the student's parameters.

Backward Pass and Optimization

Calculate Loss: Compute the combined distillation loss ( $L_{Total}$ ) using the outputs from both forward passes ( $z_S, z_T$ , potentially $h_S, h_T$ , etc.) and ground truth labels (if applicable).
Backpropagation: Perform backpropagation based on $L_{Total}$ . Critically, gradients should only flow back through the student model's parameters. The teacher model remains frozen.
Optimizer Step: Update the student model's weights using an optimizer (e.g., AdamW).

Code Structure (using Hugging Face Trainer)

While a custom training loop offers maximum flexibility, the Hugging Face Trainer can be subclassed to incorporate distillation.

# Subclass of Trainer
from transformers import Trainer
import torch.nn.functional as F
import torch.nn as nn

class DistillationTrainer(Trainer):
    def __init__(self, *args, teacher_model=None, temperature=2.0, alpha=0.5, **kwargs):
        super().__init__(*args, **kwargs)
        self.teacher_model = teacher_model
        self.teacher_model.to(self.args.device) # Ensure teacher is on the same device
        self.temperature = temperature
        self.alpha = alpha
        # Potentially add projection layers here if matching hidden states of different sizes

    def compute_loss(self, model, inputs, return_outputs=False):
        # Student forward pass (standard Trainer behavior)
        student_outputs = model(**inputs)
        student_logits = student_outputs.logits

        # Compute standard CE loss if labels are provided
        if "labels" in inputs:
            loss_ce = student_outputs.loss # Trainer calculates this by default
        else:
            loss_ce = 0.0 # Or handle cases without labels appropriately

        # Teacher forward pass (no gradients)
        with torch.no_grad():
            teacher_outputs = self.teacher_model(**inputs)
            teacher_logits = teacher_outputs.logits

        # Compute KD loss (ensure proper slicing/alignment for causal LM)
        # Typically compare logits for predicted tokens (shift logits and labels)
        vocab_size = student_logits.size(-1)
        student_log_probs = F.log_softmax(student_logits[:, :-1, :] / self.temperature, dim=-1)
        teacher_probs = F.softmax(teacher_logits[:, :-1, :] / self.temperature, dim=-1)

        # KLDivLoss expects log-probs as input, probs as target
        loss_kd = nn.KLDivLoss(reduction="batchmean")(student_log_probs, teacher_probs) * (self.temperature ** 2)

        # Combine losses
        # If using labels:
        # loss = (1.0 - self.alpha) * loss_ce + self.alpha * loss_kd
        # If NOT using labels (pure distillation from teacher signals):
        loss = loss_kd # Adjust alpha logic if needed

        # Add hidden state matching loss here if applicable
        # loss_hidden = compute_hidden_state_loss(...)
        # loss += beta * loss_hidden

        return (loss, student_outputs) if return_outputs else loss

# Setup Training Arguments
# training_args = TrainingArguments(...)

# Instantiate the DistillationTrainer
# trainer = DistillationTrainer(
#     model=student_model,
#     teacher_model=teacher_model,
#     args=training_args,
#     train_dataset=tokenized_dataset["train"],
#     eval_dataset=tokenized_dataset["validation"],
#     tokenizer=student_tokenizer,
#     # data_collator=... # Important for padding and causal LM label shifting
#     temperature=2.0,
#     alpha=0.5,
# )

# Start training
# trainer.train()

Note: This code is. Implementing the label shifting for causal LMs and handling padding correctly within the loss calculation requires careful attention to detail.

Evaluating the Distilled Generative Model

Thorough evaluation is essential to confirm the success of the distillation process.

Quantitative Metrics

Perplexity (PPL): Measure the model's fluency and predictive accuracy on a held-out test set. Lower perplexity generally indicates better language modeling capability. Compare the student's PPL to the teacher's PPL and the PPL of a student trained from scratch on the same data (without KD).
Downstream Task Performance: Evaluate the distilled model on specific generative tasks it's intended for (e.g., summarization using ROUGE scores, translation using BLEU scores, question answering using F1/Exact Match).
Efficiency Metrics: Measure the practical benefits:
- Model Size: Number of parameters, disk footprint (MB/GB).
- Inference Latency: Time taken to generate a fixed number of tokens (ms/token).
- Throughput: Number of sequences processed or tokens generated per second.
- Memory Usage: Peak GPU memory consumption during inference.

Comparison of a teacher, distilled student, and student trained from scratch on a downstream task (e.g., summarization) versus model size. The distilled student approaches the teacher's performance with significantly fewer parameters.

Qualitative Assessment

Assess the quality of the generated text:

Coherence and Fluency: Does the text flow logically and read naturally?
Instruction Following: If distilled on instruction data, does the student follow prompts accurately?
Creativity and Diversity: Does the output exhibit variety, or does it become repetitive?
Factuality and Hallucination: Check for factual accuracy, especially compared to the teacher. Distillation can sometimes amplify biases or inaccuracies if not carefully managed.

Human evaluation or side-by-side comparisons with the teacher's output are often necessary for a comprehensive assessment.

Comparative Analysis

Always compare the distilled student against relevant baselines:

Teacher Model: How much performance was lost compared to the large teacher?
Student Trained from Scratch: How much performance was gained thanks to distillation, compared to training the same small architecture directly on the data?
Other Compression Techniques: How does KD compare to pruning or quantization applied to the teacher or student?

Practical Considerations and Advanced Topics

Handling Architectural Differences: When matching hidden states or attention between layers that don't align perfectly (e.g., teacher has 32 layers, student has 16), strategies include matching every k-th layer of the teacher to a student layer, matching the first/last N layers, or using learned projection layers.
Training Dynamics: Distillation training can sometimes be unstable. Experiment with learning rates (often lower than standard training), weight decay, gradient clipping, and warm-up schedules. The choice of temperature ( $T$ ) and weighting factor ( $\alpha$ ) significantly impacts stability and final performance.
Multi-Teacher Distillation: Knowledge can be distilled from an ensemble of teachers or multiple teachers specialized in different areas.
Sequence-Level Distillation: Instead of token-level KL divergence, sequence-level objectives (e.g., minimizing divergence over the entire generated sequence probability) can capture longer-range dependencies, although they are often more complex to implement.
Distilling Reasoning: For models exhibiting step-by-step reasoning (Chain-of-Thought), specific techniques aim to distill the intermediate reasoning steps, not just the final output.

This hands-on guide provides the foundational steps for distilling generative LLMs. Success requires careful experimentation with architectures, data, loss functions, and hyperparameters, guided by rigorous evaluation across both quantitative and qualitative dimensions. The result, when successful, is a significantly more efficient model suitable for deployment in resource-constrained settings.

Was this section helpful?