Within each encoder and decoder layer, after the attention sub-layer has processed relationships across the sequence, another essential component takes over: the Position-wise Feed-Forward Network (FFN). While the self-attention mechanism excels at integrating information from different sequence positions, the FFN provides additional computational depth and non-linearity, operating on each position independently.
Think of the attention sub-layer as the communication hub, allowing tokens to exchange information. The FFN then acts as a processing unit for each token individually, transforming the information received through attention.
The FFN is typically a simple fully connected feed-forward network consisting of two linear transformations with a non-linear activation function in between. The standard choice for the activation function in the original Transformer paper is the Rectified Linear Unit (ReLU).
The input to the FFN at a specific position t is the output vector zt from the preceding sub-layer (either the self-attention or cross-attention, after the Add & Norm step, depending on the specific layer configuration like Post-LN or Pre-LN). The computation proceeds as follows:
Combining these steps, the complete FFN operation for a single position t can be expressed as:
FFN(zt)=ReLU(ztW1+b1)W2+b2Or, more generally using f for the activation function:
FFN(zt)=f(ztW1+b1)W2+b2A critical aspect of this network is the "position-wise" application. While the same FFN (meaning the exact same weight matrices W1,W2 and biases b1,b2) is used across all positions in the sequence, it is applied independently to the vector representation at each position.
If the input sequence representation after the attention sub-layer is Z=[z1,z2,...,zn], where n is the sequence length, the FFN computes:
FFNoutput=[FFN(z1),FFN(z2),...,FFN(zn)]This contrasts sharply with the attention mechanism, which explicitly models interactions between different positions. The FFN processes each position's representation in isolation, allowing the model to learn complex transformations of individual token representations informed by the context gathered via attention.
The diagram illustrates how the same two linear layers (with shared weights W1,b1 and W2,b2) and ReLU activation are applied independently to the input vector (zt) at each sequence position.
In the original Transformer paper "Attention Is All You Need", the model dimension dmodel was 512, and the inner-layer dimension of the FFN, dff, was set to 2048. This fourfold expansion (dff=4×dmodel) is a common heuristic, although variations exist. The expansion allows the FFN to project the representation into a higher-dimensional space where complex patterns might be more easily learned, before projecting it back to the standard model dimension.
From an implementation perspective, applying the same linear transformations independently at each position is computationally efficient. It can be implemented using 1x1 convolutions. If you view the sequence of vectors Z (shape: sequence length × dmodel) as an "image" with height 1 and width equal to sequence length, then the linear transformations of the FFN correspond to convolutions with kernel size 1x1, input channels dmodel, and output channels dff (for the first layer) or dmodel (for the second layer). This formulation allows deep learning frameworks to leverage highly optimized convolution implementations for parallel processing across the sequence length.
The FFN sub-layer, along with the attention sub-layer, forms the core computational block within each encoder and decoder layer. Understanding its structure and position-wise operation is necessary for grasping how Transformers process information at both the sequence-interaction level (attention) and the individual token level (FFN). This sub-layer is followed by the residual connection and layer normalization, which we will discuss next.
© 2025 ApX Machine Learning