Masterclass
While quantization focuses on reducing the precision of the numbers used in a model, network pruning takes a different approach: it aims to eliminate parameters (weights) or even entire structural components deemed unimportant, effectively making the model sparser. The intuition is that large, over-parameterized models often contain significant redundancy, and removing some parts might not drastically impact performance, especially after retraining or fine-tuning. Pruning can lead to substantial reductions in model size and potentially speed up inference by reducing the number of computations.
There are two main categories of pruning: unstructured and structured.
Unstructured pruning operates at the finest granularity level: individual weights within the model's layers. The most common technique is magnitude-based pruning. The core idea is simple: weights with smaller absolute values contribute less to the network's output and are considered less salient.
To perform magnitude pruning, you typically:
Here's a PyTorch snippet illustrating the core idea of creating a mask for a single linear layer based on magnitude:
import torch
import torch.nn as nn
import torch.nn.utils.prune as prune
# Example: A single linear layer
layer = nn.Linear(100, 50)
# --- Magnitude Pruning ---
# Specify the desired sparsity level (e.g., remove 30% of weights)
amount_to_prune = 0.3
# Use PyTorch's pruning utility for unstructured L1 magnitude pruning
prune.l1_unstructured(layer, name="weight", amount=amount_to_prune)
# The pruning is applied 'forward pre-hook'. The original weights are stored.
# Check the pruned weights (some will be zero)
print(layer.weight)
# To make the pruning permanent (remove the mask and zero out weights directly):
prune.remove(layer, 'weight')
print(layer.weight) # Now contains permanent zeros
# Note: In practice, pruning is often followed by fine-tuning.
# The mask remains during fine-tuning, ensuring pruned weights stay zero.
Advantages of Unstructured Pruning:
Disadvantages of Unstructured Pruning:
Instead of removing individual weights, structured pruning removes entire, well-defined blocks or groups of parameters. This could involve removing:
The criteria for removing structures can vary. It might be based on the aggregate magnitude of weights within the structure (e.g., L2 norm of weights associated with a neuron), the average activation value of a neuron across a dataset, or more complex metrics related to the structure's contribution to the model's output or loss.
import torch
import torch.nn as nn
import torch.nn.utils.prune as prune
import numpy as np
# Example: A linear layer and pruning 'neurons' (output channels)
layer = nn.Linear(100, 50) # 50 output neurons
# --- Structured Pruning (Example: Pruning Neurons/Output Channels) ---
# Let's say we want to prune 10 out of 50 neurons (20%)
num_neurons_to_prune = 10
# Calculate the L2 norm of the weights associated with each output neuron
# layer.weight has shape [out_features, in_features] = [50, 100]
# We calculate the norm along the input dimension (dim=1)
neuron_norms = torch.norm(layer.weight.data, p=2, dim=1)
# Find the indices of the neurons with the smallest norms
threshold = torch.kthvalue(neuron_norms, k=num_neurons_to_prune).values
indices_to_prune = torch.where(neuron_norms <= threshold)[0]
# Use PyTorch's structured pruning utility
# (pruning entire output channels)
# We specify the dimension corresponding to output channels (dim=0)
prune.ln_structured(
layer,
name="weight",
amount=num_neurons_to_prune,
n=2,
dim=0
)
# Again, the pruning is applied via hooks.
# Check the weights - entire rows corresponding to pruned neurons
# will be zero.
# print(layer.weight)
# Make permanent
prune.remove(layer, 'weight')
# print(layer.weight)
# Note: After structured pruning, the layer's output dimension
# effectively changes.
# Subsequent layers might need adjustment, or the pruned model needs
# fine-tuning.
# Unlike unstructured pruning, structured pruning often results in a
# genuinely smaller, dense model after removing the zeroed structures
# permanently.
Advantages of Structured Pruning:
Disadvantages of Structured Pruning:
The choice between unstructured and structured pruning depends on the specific goals and constraints:
Feature | Unstructured Pruning | Structured Pruning |
---|---|---|
Granularity | Individual weights | Neurons, Heads, Layers, Channels |
Sparsity Pattern | Irregular | Regular (smaller dense tensors/layers) |
Hardware Accel. | Difficult (requires specialized support) | Easier (uses standard dense operations) |
Potential Sparsity | Higher | Typically Lower |
Implementation | Mask management, sparse kernels | Architectural changes, dense kernels |
Accuracy Impact | Potentially lower (at high sparsity) | Potentially higher (at same sparsity) |
A critical aspect of nearly all pruning methods is the need for fine-tuning. Simply removing weights or structures usually degrades model performance. To recover accuracy, the pruned model must be retrained (fine-tuned) on the original dataset or a relevant task-specific dataset for some number of epochs. During this fine-tuning phase, the unpruned weights adjust to compensate for the removed components.
Pruning can also be performed iteratively: prune a small percentage of weights, fine-tune, prune again, fine-tune, and so on. This gradual process often yields better results than pruning a large fraction of the model all at once.
Network pruning offers a powerful way to reduce the computational footprint of LLMs. While unstructured pruning promises higher compression ratios, its practical benefits often hinge on specialized hardware or software. Structured pruning provides a more direct path to acceleration on standard hardware by creating smaller, dense models, albeit potentially at the cost of lower maximum sparsity. Both approaches typically necessitate careful fine-tuning to restore the model's predictive capabilities.
© 2025 ApX Machine Learning