Position-wise Feed-Forward Networks (FFNs) are important components within each encoder and decoder layer. These networks process vectors, adding further processing capacity to the model.
Despite its name, it's actually a very simple network. It consists of two linear transformations with a Rectified Linear Unit (ReLU) activation function in between. The "position-wise" part is significant: this same feed-forward network is applied independently and identically to each position (i.e., each token's representation) in the sequence.
So, if your input sequence representation after the attention sub-layer has dimensions (sequence length, dmodel), the FFN processes each of the sequence length vectors of size dmodel separately using the same set of weights. It doesn't look across different positions at this stage; that interaction happened in the attention layer.
Structure of the FFN
The transformation applied at each position x can be described mathematically as:
FFN(x)=ReLU(xW1+b1)W2+b2
Where:
- x is the output from the preceding layer (the Add & Norm layer following the attention mechanism) for a specific position.
- W1 and b1 are the weight matrix and bias term for the first linear transformation.
- W2 and b2 are the weight matrix and bias term for the second linear transformation.
- ReLU is the Rectified Linear Unit activation function, defined as ReLU(z)=max(0,z).
The dimensionality typically changes within the FFN. The input and output dimension is dmodel (the model's embedding dimension), but the inner layer dimension, often denoted as dff, is usually larger. A common configuration, as used in the original Transformer paper, is dff=4×dmodel. For example, if dmodel=512, then dff=2048.
- The first linear layer projects the dmodel-dimensional input vector x into a higher-dimensional space (dff).
- The ReLU activation introduces non-linearity, allowing the model to learn more complex functions. Without non-linearities like ReLU, stacking multiple layers wouldn't add much modeling power of a single linear transformation.
- The second linear layer projects the result back down to the original dmodel dimension, ready for the next layer or block.
A view of the Position-wise Feed-Forward Network applied to a single position's vector representation.
Why Use It?
You might wonder why this component is necessary after the attention mechanism. While attention handles the sequence-level interactions and context aggregation, the FFN provides additional computational depth and non-linearity at each position independently.
- Increased Model Capacity: It adds learnable parameters, increasing the model's ability to represent complex patterns within the features of each token.
- Non-linearity: As mentioned, the ReLU activation is essential for the model's ability to learn non-linear relationships. Attention layers themselves often involve linear transformations internally, so the FFN adds necessary non-linear processing.
- Feature Transformation: It can be thought of as transforming the features learned by the attention mechanism into a more suitable representation for the next layer or block. The expansion to dff and contraction back to dmodel allows the network to potentially learn richer combinations of features in the higher-dimensional space before projecting them back.
Although simple, the Position-wise Feed-Forward Network is an integral part of each Transformer block, working in tandem with the attention mechanism and the Add & Norm layers to process sequence information effectively. It contributes significantly to the overall performance of the Transformer architecture.