As we've established, the standard Transformer architecture, especially when scaled to large sizes, demands substantial computational resources and memory. The quadratic complexity of self-attention is one bottleneck, addressed by techniques like sparse or linear attention. However, another significant factor contributing to resource requirements is the sheer number of parameters, particularly in deep models with many layers. This section focuses on techniques designed to reduce the parameter count, making models smaller, potentially faster to train (per epoch, due to less data movement), and easier to deploy, often with minimal impact on performance.
The central idea revolves around parameter sharing, where the same set of weights is used in multiple parts of the network. This contrasts with standard deep networks where each layer typically has its own unique set of parameters.
In many Natural Language Processing (NLP) tasks, the vocabulary size (V) can be very large (tens or hundreds of thousands). The input embedding layer maps these discrete tokens to dense vectors of hidden size H. This results in an embedding matrix of size V×H. In large models like BERT, the hidden size H is also large (e.g., 768 or 1024). Consequently, the embedding matrix constitutes a significant portion of the total parameters. For instance, with V=50,000 and H=1024, the embedding matrix alone has over 50 million parameters.
The core insight behind factorized embedding parameterization, prominently featured in models like ALBERT (A Lite BERT for Self-supervised Learning of Language Representations), is that the embeddings serve two purposes: capturing context-independent token representations and projecting these into the hidden space of the Transformer layers. ALBERT argues that the context-independent representation might not need the full dimensionality H.
Instead of directly learning a V×H matrix, this technique factors it into two smaller matrices:
The total number of parameters for the embedding becomes V×E+E×H. Compared to the original V×H, this can lead to substantial savings if E is chosen significantly smaller than H. For example, if V=50,000, H=1024, and we choose E=128, the parameters change from 50,000×1024≈51.2M to (50,000×128)+(128×1024)=6.4M+0.13M≈6.53M. This is nearly an 8x reduction in embedding parameters.
This factorization decouples the vocabulary size from the hidden size of the Transformer, allowing for large vocabularies without disproportionately increasing the model size due to embeddings. It implicitly assumes that the semantic richness of individual words can be captured in a lower-dimensional space (E) before being projected into the higher-dimensional space (H) used for context-dependent processing within the Transformer layers.
Another effective technique, also central to ALBERT, is sharing parameters across the Transformer layers. In a standard Transformer encoder stack with L layers, each layer typically has its own unique weight matrices for the multi-head self-attention (Q, K, V projections, output projection) and the position-wise feed-forward network (FFN). This means the parameters scale linearly with the number of layers.
Cross-layer parameter sharing breaks this assumption. Instead of learning L distinct sets of layer parameters, a single set of parameters is learned and reused across all L layers. Common variants include:
Sharing all parameters, as done in ALBERT, dramatically reduces the number of parameters related to the Transformer layers. If a single layer has Player parameters, a standard L-layer model has L×Player parameters in the layers, whereas a fully shared model has only Player parameters for all layers combined (excluding embeddings and normalization layers, which might or might not be shared).
Comparison of approximate parameter counts for embeddings and encoder layers in a standard BERT-Base model versus an ALBERT-Base model implementing factorization (E=128) and full cross-layer sharing. Note the significant reduction in both components for ALBERT.
Empirical results, particularly from the ALBERT paper, showed that cross-layer parameter sharing can yield models that are significantly smaller but perform comparably to, or sometimes even better than, their non-shared counterparts on various NLP benchmarks, especially when combined with factorized embeddings. It seems to act as a form of regularization, potentially improving generalization by preventing later layers from diverging too much in function from earlier ones. However, it can sometimes slow down training convergence compared to non-shared models, possibly because the shared parameters have to accommodate the functional requirements of different processing stages within the network depth.
Parameter sharing techniques offer compelling advantages:
However, there are potential drawbacks:
Choosing the right parameter efficiency technique involves balancing these trade-offs. Factorized embeddings are often a relatively safe way to reduce parameters with minimal performance impact, especially for large vocabularies. Cross-layer parameter sharing is more aggressive but has proven surprisingly effective, as demonstrated by ALBERT. Evaluating these techniques empirically on the specific task and dataset remains an important step in model development. These methods contribute significantly to the ongoing effort to build powerful yet practical Transformer models.
© 2025 ApX Machine Learning