Before we can inject information about sequence order, the Transformer needs a way to represent the input tokens themselves in a format suitable for processing by neural networks. Raw text is first converted into a sequence of integers, known as token IDs, through a tokenization process (using methods like Byte Pair Encoding or WordPiece, which are typically covered in foundational NLP courses). These discrete token IDs, however, are not directly suitable for input into the attention mechanism or subsequent neural network layers. They lack a notion of similarity or relationship between different tokens.
The Input Embedding Layer serves as the initial transformation step, converting these integer token IDs into dense, continuous vector representations. This contrasts sharply with older techniques like one-hot encoding, which produce very high-dimensional and sparse vectors (mostly zeros, with a single one). Dense embeddings, on the other hand, capture semantic similarities between tokens in a much lower-dimensional space, typically referred to as dmodel (the model's hidden dimension).
At its core, the input embedding layer functions as a lookup table. This table is represented by a weight matrix, let's call it E, with dimensions V×dmodel, where V is the size of the vocabulary (the total number of unique tokens the model recognizes) and dmodel is the chosen embedding dimension.
When presented with an input sequence of token IDs, [t1,t2,...,tn], the embedding layer retrieves the corresponding vector for each token ID from the matrix E. If tk is the token ID at position k, its embedding vector ek is simply the row in the matrix E indexed by tk:
ek=Etk
This operation effectively maps each integer ID in the input sequence to a dense vector of size dmodel.
Mapping discrete token IDs to continuous vector representations via embedding lookup.
Significantly, the embedding matrix E is not fixed; its values are learnable parameters. During the training process, gradients are backpropagated through the network, and the embedding vectors are adjusted to minimize the overall loss function. This means the model learns to place tokens with similar semantic meanings or functions closer together in the dmodel-dimensional embedding space. For example, words like "king" and "queen" might end up having similar embedding vectors after training, reflecting their semantic relationship.
The output of the input embedding layer is a sequence of vectors, [e1,e2,...,en], where each ek is a vector of size dmodel. This sequence now represents the input tokens in a continuous space but retains the original sequence length. The dimensionality dmodel (e.g., 512, 768, 1024) is a fundamental hyperparameter of the Transformer architecture, determining the width of the vector representations throughout most of the model's layers.
This sequence of token embeddings is the foundation upon which positional information will be added, as discussed next. While these embeddings now carry semantic meaning learned from data, they still lack any inherent representation of their position within the original sequence, a limitation addressed by positional encodings.
© 2025 ApX Machine Learning