While quantization focuses on reducing the precision of the numbers used in a model, network pruning takes a different approach: it aims to eliminate parameters (weights) or even entire structural components deemed unimportant, effectively making the model sparser. The intuition is that large, over-parameterized models often contain significant redundancy, and removing some parts might not drastically impact performance, especially after retraining or fine-tuning. Pruning can lead to substantial reductions in model size and potentially speed up inference by reducing the number of computations.

There are two main categories of pruning: unstructured and structured.

Unstructured Pruning

Unstructured pruning operates at the finest granularity level: individual weights within the model's layers. The most common technique is magnitude-based pruning. The core idea is simple: weights with smaller absolute values contribute less to the network's output and are considered less salient.

To perform magnitude pruning, you typically:

Define a Sparsity Target: Decide what fraction of weights you want to remove (e.g., 50% sparsity means removing half the weights).
Rank Weights: Collect all the weights in the model (or within specific layers) and rank them by their absolute magnitude.
Determine Threshold: Find the magnitude threshold corresponding to the desired sparsity level. For example, if targeting 50% sparsity, the threshold would be the median absolute weight value.
Create a Mask: Create a binary mask of the same shape as the weight tensors. Set mask elements to 0 for weights below the threshold and 1 for weights above it.
Apply Mask: During forward passes (and potentially backward passes during fine-tuning), multiply the weights by this mask, effectively zeroing out the pruned weights.

Here's a PyTorch snippet illustrating the core idea of creating a mask for a single linear layer based on magnitude:

import torch
import torch.nn as nn
import torch.nn.utils.prune as prune

# Example: A single linear layer
layer = nn.Linear(100, 50)

# --- Magnitude Pruning ---
# Specify the desired sparsity level (e.g., remove 30% of weights)
amount_to_prune = 0.3

# Use PyTorch's pruning utility for unstructured L1 magnitude pruning
prune.l1_unstructured(layer, name="weight", amount=amount_to_prune)

# The pruning is applied 'forward pre-hook'. The original weights are stored.
# Check the pruned weights (some will be zero)
print(layer.weight)

# To make the pruning permanent (remove the mask and zero out weights directly):
prune.remove(layer, 'weight')
print(layer.weight) # Now contains permanent zeros

# Note: In practice, pruning is often followed by fine-tuning.
# The mask remains during fine-tuning, ensuring pruned weights stay zero.

Advantages of Unstructured Pruning:

High Sparsity Potential: Can often remove a large percentage of weights without significant accuracy loss, especially in highly over-parameterized models.
Fine-grained: Targets only the least important individual connections.

Disadvantages of Unstructured Pruning:

Irregular Sparsity: Results in sparse weight matrices with zeros scattered irregularly. This pattern is difficult for standard hardware (GPUs, TPUs) and libraries (cuBLAS, cuDNN) to accelerate effectively, as they are optimized for dense matrix operations.
Specialized Hardware/Software: Achieving significant inference speedups often requires specialized hardware or software libraries designed to handle sparse computations efficiently (e.g., sparse matrix formats like CSR/CSC).
Metadata Overhead: Storing the indices of the non-zero elements (the sparsity mask) can add some memory overhead, partially offsetting the size reduction from removing weights, especially at lower sparsity levels.

Structured Pruning

Instead of removing individual weights, structured pruning removes entire, well-defined blocks or groups of parameters. This could involve removing:

Neurons/Filters: Entire rows or columns in weight matrices corresponding to specific neurons. In Convolutional Neural Networks (CNNs), this is akin to removing entire filters. In Transformers, this could mean removing neurons within the Feed-Forward Network (FFN) layers.
Attention Heads: Removing complete attention heads within the Multi-Head Attention layers.
Layers: In extreme cases, entire layers might be removed.

The criteria for removing structures can vary. It might be based on the aggregate magnitude of weights within the structure (e.g., L2 norm of weights associated with a neuron), the average activation value of a neuron across a dataset, or more complex metrics related to the structure's contribution to the model's output or loss.

import torch
import torch.nn as nn
import torch.nn.utils.prune as prune
import numpy as np

# Example: A linear layer and pruning 'neurons' (output channels)
layer = nn.Linear(100, 50) # 50 output neurons

# --- Structured Pruning (Example: Pruning Neurons/Output Channels) ---
# Let's say we want to prune 10 out of 50 neurons (20%)
num_neurons_to_prune = 10

# Calculate the L2 norm of the weights associated with each output neuron
# layer.weight has shape [out_features, in_features] = [50, 100]
# We calculate the norm along the input dimension (dim=1)
neuron_norms = torch.norm(layer.weight.data, p=2, dim=1)

# Find the indices of the neurons with the smallest norms
threshold = torch.kthvalue(neuron_norms, k=num_neurons_to_prune).values
indices_to_prune = torch.where(neuron_norms <= threshold)[0]

# Use PyTorch's structured pruning utility
# (pruning entire output channels)
# We specify the dimension corresponding to output channels (dim=0)
prune.ln_structured(
    layer,
    name="weight",
    amount=num_neurons_to_prune,
    n=2,
    dim=0
)

# Again, the pruning is applied via hooks.
# Check the weights - entire rows corresponding to pruned neurons
# will be zero.
# print(layer.weight)

# Make permanent
prune.remove(layer, 'weight')
# print(layer.weight)

# Note: After structured pruning, the layer's output dimension
# effectively changes.
# Subsequent layers might need adjustment, or the pruned model needs
# fine-tuning.
# Unlike unstructured pruning, structured pruning often results in a
# genuinely smaller, dense model after removing the zeroed structures
# permanently.

Advantages of Structured Pruning:

Hardware Friendly: Results in smaller, dense matrices or removes entire components. Standard hardware and libraries can efficiently process the resulting model.
Direct Speedup: Often yields immediate inference speedups without requiring specialized sparse computation libraries.
Simpler Implementation: Modifying the architecture by removing whole blocks can sometimes be simpler than managing sparse matrix formats.

Disadvantages of Structured Pruning:

Coarser-grained: Removing entire structures can be less precise and might impact accuracy more significantly than removing only the least important individual weights, especially at higher pruning ratios.
Lower Sparsity Limits: Often achieves lower overall sparsity levels compared to unstructured pruning before accuracy starts to degrade substantially.

Comparing Structured and Unstructured Pruning

The choice between unstructured and structured pruning depends on the specific goals and constraints:

Feature	Unstructured Pruning	Structured Pruning
Granularity	Individual weights	Neurons, Heads, Layers, Channels
Sparsity Pattern	Irregular	Regular (smaller dense tensors/layers)
Hardware Accel.	Difficult (requires specialized support)	Easier (uses standard dense operations)
Potential Sparsity	Higher	Typically Lower
Implementation	Mask management, sparse kernels	Architectural changes, dense kernels
Accuracy Impact	Potentially lower (at high sparsity)	Potentially higher (at same sparsity)

Pruning and Fine-tuning

A critical aspect of nearly all pruning methods is the need for fine-tuning. Simply removing weights or structures usually degrades model performance. To recover accuracy, the pruned model must be retrained (fine-tuned) on the original dataset or a relevant task-specific dataset for some number of epochs. During this fine-tuning phase, the unpruned weights adjust to compensate for the removed components.

Pruning can also be performed iteratively: prune a small percentage of weights, fine-tune, prune again, fine-tune, and so on. This gradual process often yields better results than pruning a large fraction of the model all at once.

Network pruning offers a powerful way to reduce the computational footprint of LLMs. While unstructured pruning promises higher compression ratios, its practical benefits often hinge on specialized hardware or software. Structured pruning provides a more direct path to acceleration on standard hardware by creating smaller, dense models, albeit potentially at the cost of lower maximum sparsity. Both approaches typically necessitate careful fine-tuning to restore the model's predictive capabilities.

Was this section helpful?