While unstructured pruning offers maximum flexibility by removing individual weights, its irregular sparsity patterns often require specialized hardware or software support (like sparse matrix libraries) to translate into actual inference speedups. Structured pruning takes a different approach by removing entire, predefined groups of parameters. This creates more regular sparsity patterns that standard hardware like GPUs and TPUs can often handle more efficiently, leading to direct reductions in computation and memory bandwidth requirements without needing specialized kernels.
The core idea is to identify and eliminate larger structural components of the network that contribute least to the overall model performance. For Transformer-based LLMs, several structured pruning techniques are particularly relevant:
The Feed-Forward Network (FFN) sub-layers within each Transformer block typically consist of two linear transformations with an activation function in between. Neuron pruning targets the intermediate layer of this FFN, removing entire rows from the first weight matrix and corresponding columns from the second weight matrix. This effectively removes specific 'neurons' or dimensions from the intermediate representation.
Consider an FFN defined as:
FFN(x)=Linear2(Activation(Linear1(x)))where x is the input, Linear1 maps from dimension dmodel to dffn, and Linear2 maps from dffn back to dmodel. Pruning a neuron in the intermediate layer means reducing the dimension dffn. If we remove the k-th neuron, we remove the k-th row of the weight matrix in Linear1 and the k-th column of the weight matrix in Linear2.
Importance Criteria: How do we decide which neurons to prune? Common criteria include:
Pruning neurons directly reduces the width (dffn) of the FFN layers, shrinking the weight matrices and decreasing the number of floating-point operations (FLOPs) required during inference.
Multi-Head Attention (MHA) allows the model to jointly attend to information from different representation subspaces at different positions. Attention head pruning involves removing entire attention heads from the MHA layers.
Recall that in MHA, the input queries (Q), keys (K), and values (V) are linearly projected h times (where h is the number of heads) using different learned projection matrices for each head. The attention mechanism is applied independently for each head, and the results are concatenated and projected back to the model dimension.
MultiHead(Q,K,V)=Concat(head1,...,headh)WO where headi=Attention(QWiQ,KWiK,VWiV)Pruning the j-th attention head means removing the corresponding projection matrices WjQ,WjK,WjV and the corresponding segment of the output projection matrix WO.
Importance Criteria: Selecting heads for pruning often involves more sophisticated metrics than simple weight norms:
Removing attention heads directly reduces the computational cost of the attention mechanism, as fewer independent attention computations are performed, and the subsequent concatenation and output projection involves smaller matrices.
Comparison between irregular sparsity from unstructured pruning and regular sparsity resulting from removing entire neurons or attention heads in structured pruning. Pruned components are shown in gray.
While neuron and head pruning are the most common for Transformers, other granularities exist:
Structured pruning is often implemented iteratively:
Fine-tuning is a significant step. Removing large chunks of the model can initially cause a noticeable drop in performance. Fine-tuning allows the remaining parameters to adapt and compensate for the removed components.
The primary advantage of structured pruning lies in its direct mapping to hardware efficiency. Removing an entire attention head or FFN neuron results in smaller dense matrix multiplications and reduced memory requirements. Standard deep learning libraries and hardware accelerators can execute these smaller dense operations efficiently without needing specialized sparse computation support. This often translates to measurable reductions in latency and memory footprint, making it attractive for deploying LLMs in resource-constrained environments.
Structured pruning forces choices at a coarser granularity than unstructured pruning. While it yields more hardware-friendly models, it might remove some parameters that unstructured pruning would have kept, potentially leading to a slightly larger accuracy drop for the same parameter count reduction. The choice between unstructured and structured pruning often depends on the target hardware, the required level of compression, and the acceptable performance degradation. Combining structured pruning with other techniques like quantization can further enhance efficiency, a topic explored in the next section.
© 2025 ApX Machine Learning