After the final decoder layer processes the sequence information, incorporating masked self-attention and encoder-decoder attention, we are left with a sequence of output vectors. Each vector at position t in the output sequence represents the decoder's understanding of what the token at that position should be, given the input sequence and the previously generated output tokens up to position t−1.
However, these output vectors are typically high-dimensional and contain continuous values. They don't directly tell us which specific word from our vocabulary is the most likely choice for that position. To get from these internal representations to actual predicted words, we need two final steps: a Linear transformation and a Softmax activation function.
The first step is to project the high-dimensional output vector produced by the decoder stack into a vector whose size matches our target vocabulary. For example, if our model needs to predict words from a vocabulary of 50,000 unique tokens, this linear layer will transform the decoder's output vector (say, of dimension dmodel=512) into a vector of dimension 50,000.
This is achieved using a standard fully connected linear layer, sometimes called a projection layer. Mathematically, if xout∈Rdmodel is the output vector from the top decoder layer for a specific position, and the vocabulary size is V, the linear layer performs:
logits=Wprojxout+bprojHere, Wproj is a weight matrix of shape (V,dmodel) and bproj is a bias vector of shape (V,1). Both Wproj and bproj are learned parameters, adjusted during the training process. The resulting vector, often called logits
, has dimension V. Each element in the logits
vector corresponds to a score for one specific token in the vocabulary. A higher score suggests that the corresponding token is more likely, according to the model at this stage.
The logits
vector provides scores, but they aren't probabilities. The scores can be positive or negative, and they don't sum up to 1. To convert these scores into a proper probability distribution over the vocabulary, we apply the Softmax function.
The Softmax function takes the logits
vector as input and outputs a probability vector of the same dimension (V). For a logits
vector z=[z1,z2,...,zV], the Softmax function calculates the probability Pi for the i-th token in the vocabulary as:
This function has two desirable properties:
The resulting vector P=[P1,P2,...,PV] represents the model's predicted probability distribution over the entire vocabulary for the current position in the output sequence.
Flow from the final decoder output vector through the linear layer and softmax function to produce a probability distribution over the target vocabulary.
During inference (when generating new text), the model typically selects the token with the highest probability from this distribution as the predicted token for the current position. This predicted token is then often fed back into the decoder as input for generating the next token in the sequence. During training, this probability distribution is compared against the actual target token using a loss function (like cross-entropy) to calculate the error and update the model's parameters.
This combination of a linear layer followed by a Softmax function is the standard way to map the internal representations learned by the Transformer's decoder back to the space of observable tokens, enabling the model to make concrete predictions.
© 2025 ApX Machine Learning