Having explored the fundamental mechanisms of self-attention and positional encoding, we now examine how these components are assembled into the complete Transformer architecture. The original Transformer model, proposed in the paper "Attention Is All You Need," follows an encoder-decoder structure, a common pattern in sequence-to-sequence tasks like machine translation or text summarization. Unlike recurrent models that process sequences step-by-step, the Transformer leverages attention mechanisms to process the entire input sequence simultaneously, capturing dependencies regardless of their distance.
The architecture comprises two main parts: a stack of encoders and a stack of decoders.
High-level view of the Transformer architecture, illustrating the flow from input tokens through the encoder and decoder stacks to output probabilities. Note the connection carrying the encoder's output to the decoder.
The role of the encoder is to process the entire input sequence and generate a sequence of continuous representations (contextualized embeddings) that encode information about the input. It consists of a stack of N identical layers (typically N=6 in the original paper). Each layer has two primary sub-layers:
Residual connections are employed around each of the two sub-layers, followed by layer normalization. This means the output of each sub-layer is LayerNorm(x+Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself (e.g., multi-head attention or the feed-forward network). The self-attention mechanism allows each position in the encoder to attend to all positions in the previous layer's output, effectively capturing dependencies within the input sequence. The feed-forward network is applied independently to each position.
The decoder's role is to generate the output sequence, typically one element at a time in an autoregressive manner. Similar to the encoder, it also consists of a stack of N identical layers. Each decoder layer, however, has three primary sub-layers:
Residual connections and layer normalization are applied around each of these sub-layers, just as in the encoder. The masked self-attention ensures that predictions for position i can depend only on the known outputs at positions less than i, preserving the autoregressive property needed for generation. The cross-attention mechanism is central to the sequence-to-sequence functionality: it allows each position in the decoder to attend to all positions in the input sequence (via the encoder's output representations).
The communication between the encoder and decoder happens primarily through the cross-attention mechanism in each decoder layer. The entire encoder stack first processes the input sequence to produce a sequence of output vectors z=(z1,...,zn). These vectors z are then used as the source for the Keys (K) and Values (V) in the cross-attention sub-layer of every decoder layer. The Queries (Q) for this cross-attention layer come from the output of the preceding sub-layer (the masked self-attention layer) within the decoder. This allows the decoder, at each step of generating the output sequence, to focus on relevant parts of the input sequence encoded in z.
Before the input sequence enters the encoder stack, the input tokens are converted into vectors using an embedding layer, and positional encodings are added to these embeddings to inject information about the sequence order. Similarly, for the decoder, the target output tokens (shifted right during training, or the previously generated tokens during inference) are embedded and combined with positional encodings before entering the decoder stack.
After the final decoder layer produces its output vectors, a final linear transformation followed by a softmax function is typically used to convert these vectors into probability distributions over the target vocabulary, allowing for the prediction of the next token in the output sequence.
This overall structure, built upon stacked layers containing attention mechanisms, residual connections, and layer normalization, forms the basis of the Transformer model. Subsequent sections will dissect each component, such as the specific structure of encoder and decoder layers, masked attention, cross-attention, feed-forward networks, and normalization, in greater detail.
© 2025 ApX Machine Learning