After the final decoder layer has processed the sequence, incorporating information from the input sequence (via cross-attention) and the previously generated output tokens (via masked self-attention), we are left with a sequence of high-dimensional representation vectors. For each position in the output sequence, the decoder stack produces a vector of dimension dmodel. While this vector encodes rich contextual information, it isn't directly interpretable as a probability distribution over the target vocabulary, which is necessary for tasks like translation or text generation.
The final step in the Transformer's decoder architecture involves converting these output vectors into usable probabilities. This is typically achieved through two sequential operations: a final linear transformation followed by a softmax activation function.
The output from the top decoder layer is a tensor of shape (batch_size, target_sequence_length, dmodel). The purpose of the final linear layer is to project this dmodel-dimensional representation vector for each position onto a vector whose dimension equals the size of the target vocabulary, let's denote this size as V.
This is a standard fully connected linear layer without an activation function (or sometimes viewed as having an identity activation). If we represent the output of the final decoder layer as Hdec∈Rbatch_size×target_sequence_length×dmodel, and the weight matrix of the linear layer as Wout∈Rdmodel×V and its bias as bout∈RV, the operation can be described as:
Logits=HdecWout+boutHere, the matrix multiplication is applied independently to the dmodel-dimensional vector at each position along the target_sequence_length
dimension. The resulting Logits
tensor has a shape of (batch_size, target_sequence_length, V). Each vector of size V at a specific position t
in the sequence contains raw, unnormalized scores (logits) for every possible token in the target vocabulary. A higher score suggests a higher likelihood for that token at that position.
Implementation Note: A common technique, particularly in models where the source and target vocabularies are shared (or closely related), is to share the weights between the input embedding layer and this final linear layer. The input embedding layer projects from vocabulary indices (effectively one-hot vectors) to dmodel, while the final linear layer projects from dmodel back to vocabulary scores. Sharing the weight matrix Wout with the input embedding matrix (possibly transposed) significantly reduces the number of model parameters, especially for large vocabularies, and has been shown to work well empirically.
The logits produced by the linear layer are real-valued scores and don't form a probability distribution. To convert these scores into probabilities, the softmax function is applied independently to the logit vector at each position in the target sequence.
For a specific position t in the sequence, let zt∈RV be the logit vector. The softmax function computes the probability pi for the i-th token in the vocabulary (where i ranges from 1 to V) as follows:
pi=Softmax(zt)i=∑j=1Vezt,jezt,iThis operation yields a probability vector pt∈RV for each position t, where:
The output of the entire Transformer decoder stack is therefore a tensor of shape (batch_size, target_sequence_length, V), where each vector along the last dimension represents a probability distribution over the target vocabulary.
This final linear layer and softmax function bridge the gap between the complex internal representations learned by the Transformer and the discrete, probabilistic nature of language generation tasks. They provide the necessary mechanism to translate the model's understanding into concrete predictions over the vocabulary.
© 2025 ApX Machine Learning