The encoder's primary responsibility within the Transformer architecture is to process the input sequence and generate a sequence of contextual representations, often called hidden states or context vectors. Think of it as reading the input sentence and building a rich understanding of each word in relation to all other words in that sentence. This processed sequence captures not just the meaning of individual words but also how they interact within the context of the entire input.
This encoding process doesn't happen in a single step. Instead, the Transformer employs a stack of identical encoder layers. The original "Attention Is All You Need" paper used N=6 layers, but this number can vary depending on the specific model implementation and task. The output from one layer becomes the input for the next layer up the stack. This stacking allows the model to learn progressively more complex representations and relationships within the input data.
Let's examine the internal structure of a single encoder layer. Each encoder layer consists of two main sub-layers:
Multi-Head Self-Attention Mechanism: This is the first sub-layer. As discussed in the previous chapter, self-attention allows the model to weigh the importance of different words in the input sequence when encoding a specific word. Multi-head attention enhances this by performing the self-attention process multiple times in parallel with different learned linear projections (different sets of Q, K, V weight matrices). This allows each head to potentially focus on different types of relationships or different representation subspaces (e.g., one head might focus on syntactic relationships, another on semantic similarity). The outputs of the heads are concatenated and linearly projected to form the final output of this sub-layer.
Position-wise Feed-Forward Network (FFN): This is the second sub-layer. It's a relatively simple fully connected feed-forward network, typically consisting of two linear transformations with a ReLU (Rectified Linear Unit) activation function in between. The formula is often expressed as:
FFN(x)=max(0,xW1+b1)W2+b2Here, x is the output from the previous sub-layer for a specific position, W1, b1, W2, and b2 are learned parameters (weight matrices and bias vectors). A significant detail is that while the same FFN (same W1,b1,W2,b2) is used across all positions within a given layer, it is applied independently to each position's vector representation. This means the transformation for the vector representing "word A" doesn't directly depend on the transformation happening for "word B" within this FFN sub-layer itself (though context was already incorporated by the self-attention layer). This network increases the model's capacity by introducing non-linearity and allowing for more complex transformations of the attended features. The intermediate layer often expands the dimensionality (e.g., from dmodel=512 to dff=2048) before projecting it back to dmodel.
Add & Norm
)Crucially, after each of these two sub-layers (Multi-Head Self-Attention and Position-wise FFN), the encoder employs two additional operations: a residual connection followed by layer normalization. This combination is often referred to as "Add & Norm".
So, the computation within one encoder layer for an input x looks like this:
The LayerOutput then becomes the input for the next encoder layer in the stack.
Flow diagram illustrating the components and connections within a single Transformer encoder layer. Note the residual connections (dashed lines indicating the input 'x' being added before normalization) feeding into the 'Add & Norm' blocks.
The final output of the entire encoder stack (after passing through all N layers) is a sequence of vectors, one for each input token. These vectors are rich in contextual information derived from the entire input sequence via the stacked self-attention and feed-forward operations. This sequence of context vectors is then typically passed to each layer of the decoder stack, forming the basis for generating the output sequence.
© 2025 ApX Machine Learning