As introduced earlier, deploying large CNNs often demands strategies to make them more compact and computationally cheaper. Network pruning is a widely used technique that directly addresses this by removing redundant parameters or structures from a trained network, aiming to reduce its size and inference cost with minimal impact on accuracy. The underlying principle is that many deep learning models are significantly over-parameterized, containing numerous weights or even entire filters that contribute little to the final prediction.
The Rationale Behind Pruning
Deep neural networks often achieve high performance partly due to their large number of parameters. However, research and practice have shown that many of these parameters can be removed after training without substantial loss in accuracy. This redundancy might arise from the optimization process or the initial network design. Pruning exploits this redundancy by identifying and eliminating the least important components. The goal is to find a smaller sub-network within the original trained network that performs almost as well.
Types of Network Pruning
Pruning techniques can be broadly categorized based on the granularity of the elements being removed:
Unstructured Pruning (Weight Pruning)
This is the most granular form of pruning. Individual weights within the network's layers (convolutional or fully connected) are identified based on some importance criterion, typically their magnitude. Weights with magnitudes below a certain threshold are set to zero.
- Process:
- Train a dense network to convergence.
- Rank individual weights based on their absolute magnitude (or other criteria).
- Set a target sparsity level (e.g., remove 80% of weights).
- Zero out the weights with the lowest magnitudes to meet the target sparsity.
- Fine-tune the remaining non-zero weights to recover accuracy lost during pruning.
- Characteristics:
- Results in sparse weight matrices with zero values scattered irregularly.
- Can achieve very high compression ratios (removing a large percentage of weights).
- Requires specialized hardware or software libraries (e.g., sparse matrix multiplication routines) to realize significant inference speedups, as standard hardware often doesn't accelerate computations with arbitrary sparsity patterns efficiently.
- The fine-tuning step is important to regain performance.
The fine-tuning process helps the network adapt to the removal of weights. Often, this involves retraining the network with the pruned weights fixed at zero, allowing only the remaining weights to adjust.
Structured Pruning
Instead of removing individual weights, structured pruning removes entire structural elements of the network. This makes the resulting network smaller and inherently faster on standard hardware without needing specialized sparse computation libraries. Common forms include:
-
Filter Pruning: Entire filters (and their corresponding feature maps) are removed from convolutional layers. If a filter is pruned from layer i, the corresponding input channel to layer i+1 is also effectively removed.
-
Channel Pruning: Similar to filter pruning, entire channels are removed.
-
Neuron Pruning: Entire neurons (columns in weight matrices) are removed from fully connected layers.
-
Process:
- Train a dense network.
- Calculate an importance score for each structure (e.g., filter, channel). Common criteria include the L1 or L2 norm of the filter's weights, the mean activation value produced by the filter across a dataset, or gradient-based measures.
- Rank the structures based on their importance scores.
- Remove the lowest-scoring structures to meet a target reduction level.
- Fine-tune the smaller, pruned network.
-
Characteristics:
- Results in a smaller, dense network (or layers with fewer filters/neurons).
- Directly translates to reduced computation (FLOPs) and memory usage on standard hardware (CPUs, GPUs).
- May not achieve the same theoretical compression ratios as unstructured pruning but often yields better practical speedups.
- Fine-tuning remains essential.
Comparison between unstructured (weight) pruning resulting in a sparse matrix and structured (filter) pruning resulting in a smaller, dense layer.
The Pruning Workflow and Criteria
A typical pruning workflow involves iteratively pruning and fine-tuning:
- Train: Train the original, large model until convergence.
- Prune: Select a pruning method (unstructured/structured) and criterion (e.g., magnitude, norm). Remove a portion of the network components based on this criterion.
- Fine-tune: Retrain the pruned network for a number of epochs to allow the remaining parameters to adjust and recover accuracy.
- Iterate: Repeat steps 2 and 3 until the desired level of sparsity or size reduction is achieved, or until accuracy drops below an acceptable threshold.
Choosing the right criterion for identifying unimportant components is significant:
- Magnitude-based: The simplest and often effective method. Assumes parameters with small magnitudes contribute less. For structured pruning, the L1 or L2 norm of the weights within a filter/channel is commonly used.
- Gradient-based: Uses gradient information during training to estimate parameter importance. Parameters whose removal causes the smallest increase in the loss function might be considered less important.
- Activation-based: Analyzes the output activations generated by neurons or filters. For example, filters that consistently produce near-zero activations across many inputs might be pruned.
- Other Criteria: More advanced methods might involve analyzing the Hessian matrix of the loss function or using sensitivity analysis.
Considerations and Trade-offs
- Sparsity Level: Determining how much to prune is often empirical. Pruning too aggressively can irrecoverably damage performance, while pruning too little yields minimal benefits. Iterative pruning allows for gradual reduction.
- Accuracy Recovery: Fine-tuning is almost always necessary. The learning rate and duration of fine-tuning need careful adjustment. Sometimes, accuracy can even slightly improve after pruning due to regularization effects, but more commonly, there's a small drop.
- Hardware/Library Support: As mentioned, the practical speedup from unstructured pruning heavily depends on the availability of optimized libraries or hardware accelerators that can handle sparse computations efficiently. Structured pruning generally provides more predictable speedups on commodity hardware.
- Layer Sensitivity: Different layers in a network might have varying sensitivity to pruning. Early layers often learn general features and might be more sensitive than deeper layers. Some strategies involve applying different pruning ratios to different layers.
Network pruning offers a powerful set of techniques for reducing the complexity of deep learning models. By carefully removing redundant weights or structures and fine-tuning the remaining network, significant reductions in model size and computational requirements can be achieved, making models more suitable for deployment in constrained environments. The choice between unstructured and structured pruning often depends on the specific hardware target and the desired trade-off between compression ratio and ease of achieving practical speedups.