While scaling laws describe the relationship between model size, data, and performance, the specific design of the underlying architecture dictates precisely where computational resources are spent. The dominant architecture for modern LLMs is the Transformer. Understanding how its core components contribute to computational cost and memory usage is fundamental to devising effective optimization strategies. The choices made in the architecture's blueprint directly influence inference latency and hardware requirements.
Analyzing Transformer Components
The Transformer architecture, typically stacked in layers, relies on a few critical building blocks. Each block has distinct computational characteristics and presents unique opportunities and challenges for efficiency improvements.
Self-Attention Mechanism
The self-attention mechanism allows the model to weigh the importance of different tokens in the input sequence when processing a specific token. This capability is central to the LLM's understanding of context. However, it comes at a significant computational cost.
The standard dot-product attention calculation involves three learned matrices (Query WQ, Key WK, Value WV) that project the input embeddings into Q, K, and V vectors. The attention scores are computed as:
Attention(Q,K,V)=softmax(dkQKT)V
Let n be the sequence length and d be the model's hidden dimension (embedding size).
QKT Calculation: Computing the dot products between all query and key vectors involves multiplying a matrix of size (n×dk) by a matrix of size (dk×n), resulting in an (n×n) attention score matrix. This step has a computational complexity of approximately O(n2dk).
Attention Score Application: Multiplying the (n×n) attention score matrix by the Value matrix V (size n×dv) takes O(n2dv) operations.
Assuming dk=dv=d/h where h is the number of heads (discussed next), the overall complexity for a single attention head is dominated by the quadratic dependence on the sequence length, approximately O(n2d). This quadratic scaling makes processing long sequences computationally demanding and a primary bottleneck.
Furthermore, the intermediate (n×n) attention score matrix requires O(n2) memory storage. For long sequences (e.g., n>4096), this memory requirement can become substantial, often exceeding the cache capacity of processing units and leading to memory bandwidth limitations.
Basic flow of the scaled dot-product self-attention mechanism, highlighting the matrix multiplications.
Multi-Head Attention (MHA)
Instead of performing a single attention calculation, Multi-Head Attention (MHA) linearly projects the Q, K, and V vectors h times with different learned projections. Attention is computed independently for each "head" in parallel, and the results are concatenated and projected again.
Parallelism: MHA allows the model to jointly attend to information from different representation subspaces at different positions. Computationally, it splits the d-dimensional space into h smaller subspaces (typically dk=dv=d/h), potentially improving hardware utilization through parallelism.
Cost: While the total computation is roughly similar to single-head attention with dimension d (since operations are split across heads), MHA involves more matrix multiplications due to the multiple projections and the final output projection. The parameter count increases due to these projection matrices, but often the parameters per head are smaller. The O(n2d) complexity and O(n2) memory for scores remain dominant factors.
Feed-Forward Networks (FFN)
Following the attention mechanism in each Transformer block, there is a position-wise Feed-Forward Network (FFN). This typically consists of two linear transformations with a non-linear activation function in between (like ReLU, GeLU, or SwiGLU).
FFN(x)=Activate(xW1+b1)W2+b2
The intermediate ("expansion") layer dimension dff is usually larger than the model dimension d, often dff=4d.
Parameters: FFN layers constitute a large fraction of an LLM's total parameters. For instance, in a standard Transformer block, the FFN parameters (d×dff for the first layer and dff×d for the second) significantly outweigh the attention parameters (which are primarily 3×(d×d) for Q, K, V projections plus an output projection). It's common for FFNs to account for roughly two-thirds of the total parameters in an LLM.
Computation: The computational cost of the FFN is approximately O(n⋅d⋅dff). While linear in sequence length n, the large dimensions d and dff make FFNs highly compute-intensive. These dense matrix multiplications are often compute-bound on modern hardware.
Illustrative distribution of parameters between attention projection layers and feed-forward network layers within a typical Transformer block. FFNs usually dominate.
Layer Normalization and Residual Connections
Layer Normalization (LayerNorm): Applied typically before the attention and FFN modules or after them (pre-norm vs. post-norm), LayerNorm normalizes the activations across the feature dimension. Its computation involves calculating mean and variance, followed by scaling and shifting. While less expensive than attention or FFNs (O(nd)), it involves reductions and element-wise operations that add to the overall latency and require memory access.
Residual Connections: These skip connections (x+Sublayer(x)) are essential for training deep networks by allowing gradients to flow more easily. They involve simple element-wise addition, imposing minimal computational overhead but impacting the data flow and memory access patterns.
Architectural Choices and Their Efficiency Consequences
The specific configuration of these components influences overall efficiency:
Hidden Dimension (d): Increasing d generally improves model capacity but quadratically increases the computation in FFNs (O(d2) per token) and significantly impacts attention computation (O(n2d)). Memory for parameters and activations also scales accordingly.
Number of Layers: More layers increase depth and representational power but linearly increase sequential computation time and total parameter count.
Number of Heads (h): Affects the granularity of attention and potential for parallelism.
FFN Expansion Factor: The ratio dff/d directly impacts FFN parameter count and computational cost. Using variants like SwiGLU might involve three matrices instead of two in the FFN, increasing parameters further.
Sequence Length (n): The quadratic complexity O(n2) in attention makes handling long sequences a major performance challenge.
Understanding these architectural bottlenecks is the first step toward targeted optimization. Techniques like quantization and pruning often focus heavily on the large FFN layers due to their parameter count. Methods aiming to improve latency for long sequences must address the O(n2) complexity of the attention mechanism, often through specialized kernels (like FlashAttention) or approximate attention algorithms. Memory-saving techniques are critical for managing the large activation tensors, especially the attention scores. The interplay between these components and their resource demands motivates the diverse set of compression and acceleration techniques we will examine throughout this course.