All Courses

Overall Architecture Overview

Building upon our understanding of attention mechanisms from the previous chapter, we can now visualize the complete Transformer model as presented in the seminal paper "Attention Is All You Need". At its heart, the Transformer employs an encoder-decoder structure, a common pattern in sequence-to-sequence tasks like machine translation or text summarization, but it achieves this without relying on recurrence or convolution.

Instead of processing input tokens sequentially one after another like an RNN, the Transformer processes the entire input sequence simultaneously. This parallel processing capability is a significant departure and relies heavily on the self-attention mechanisms we've discussed.

The overall architecture consists of two main parts:

The Encoder Stack: Located on the left side of the standard diagram, its primary role is to process the input sequence and generate a rich, contextual representation for each token. This stack is composed of multiple identical encoder layers (often 6 in the original paper). Each token in the input sequence flows through these layers, and its representation is refined based on its relationship with all other tokens in the sequence, thanks to self-attention.
The Decoder Stack: Positioned on the right, the decoder's job is to generate the output sequence, one token at a time. Similar to the encoder, it's composed of a stack of identical decoder layers. For each step in the output generation, the decoder takes the previously generated tokens as input, along with the final representations produced by the encoder stack. It uses a modified self-attention mechanism (masked self-attention) to attend only to preceding positions in the output sequence it's generating, and another attention mechanism (encoder-decoder attention) to draw relevant information from the input sequence's encoded representation.

Let's visualize this high-level structure:

A high-level view of the Transformer architecture, showing the input processing, encoder stack, output processing, decoder stack, and final output layers. Note the connection carrying the encoder's output to each layer in the decoder stack.

The inputs (e.g., source language sentence tokens) first pass through an embedding layer and then have positional encoding added to them. This combined representation feeds into the bottom of the encoder stack. The output of the final encoder layer serves as the ( $K$ ) and value ( $V$ ) inputs for the encoder-decoder attention mechanism within each decoder layer.

Simultaneously, the outputs (e.g., target language sentence tokens, shifted right during training) are also embedded, combined with positional encoding, and fed into the bottom of the decoder stack. The decoder uses its masked self-attention to consider the generated portion of the output sequence and the encoder-decoder attention to consult the encoded input representation. Finally, the output from the top decoder layer passes through a linear transformation and a softmax function to produce probability distributions over the possible next tokens in the output vocabulary.

The subsequent sections of this chapter will break down each of these components, including positional encoding, the detailed structure of the encoder and decoder layers (including the Add & Norm steps and the feed-forward networks), and the final output generation process. This layered approach allows us to understand how the Transformer effectively captures complex dependencies within and between sequences.

Was this section helpful?