Within each encoder and decoder layer, after the attention mechanism (either self-attention or encoder-decoder attention) has done its work of aggregating information across the sequence, the resulting vectors pass through another important component: the Position-wise Feed-Forward Network (FFN). This network adds further processing capacity to the model.
Despite its name, it's actually a very simple network. It consists of two linear transformations with a Rectified Linear Unit (ReLU) activation function in between. The "position-wise" part is significant: this same feed-forward network is applied independently and identically to each position (i.e., each token's representation) in the sequence.
So, if your input sequence representation after the attention sub-layer has dimensions (sequence length, dmodel), the FFN processes each of the sequence length
vectors of size dmodel separately using the same set of weights. It doesn't look across different positions at this stage; that interaction happened in the attention layer.
The transformation applied at each position x can be described mathematically as:
FFN(x)=ReLU(xW1+b1)W2+b2Where:
The dimensionality typically changes within the FFN. The input and output dimension is dmodel (the model's embedding dimension), but the inner layer dimension, often denoted as dff, is usually larger. A common configuration, as used in the original Transformer paper, is dff=4×dmodel. For example, if dmodel=512, then dff=2048.
A conceptual view of the Position-wise Feed-Forward Network applied to a single position's vector representation.
You might wonder why this component is necessary after the attention mechanism. While attention handles the sequence-level interactions and context aggregation, the FFN provides additional computational depth and non-linearity at each position independently.
Although simple, the Position-wise Feed-Forward Network is an integral part of each Transformer block, working in tandem with the attention mechanism and the Add & Norm layers to process sequence information effectively. It contributes significantly to the overall performance of the Transformer architecture.
© 2025 ApX Machine Learning