The standard Transformer architecture, while effective, presents computational challenges, primarily the quadratic complexity O(N2) associated with self-attention relative to the input sequence length N. This complexity limits the practical application of Transformers to very long sequences.
This chapter examines these limitations and introduces several architectural modifications designed to enhance efficiency and performance. We will analyze the computational cost of vanilla self-attention and then investigate alternatives including:
By studying these variants, you will gain insight into the ongoing research and development aimed at making Transformer models more scalable and efficient for diverse applications.
6.1 Computational Complexity of Self-Attention
6.2 Sparse Attention Mechanisms
6.3 Approximating Attention: Linear Transformers
6.4 Kernel-Based Attention Approximation (Performers)
6.5 Low-Rank Projection Methods (Linformer)
6.6 Transformer-XL: Segment-Level Recurrence
6.7 Relative Positional Encodings
6.8 Pre-Normalization vs Post-Normalization (Pre-LN vs Post-LN)
6.9 Scaling Laws for Neural Language Models
6.10 Parameter Efficiency and Sharing Techniques
© 2025 ApX Machine Learning