Before the Transformer can process input text, we need to convert the sequence of words (or subwords, resulting from tokenization) into a format suitable for neural networks: vectors of real numbers. Raw text, represented as strings or even integer IDs mapping to a vocabulary, doesn't carry semantic meaning in a way networks can readily use, nor does it fit the mathematical operations involved. This crucial first step is handled by the Input Embedding layer.
Think of the Input Embedding layer as a sophisticated lookup table. First, the input text is tokenized, breaking it down into tokens (which could be words, subwords, or characters depending on the chosen tokenizer) and mapping each token to a unique integer ID based on a predefined vocabulary. We'll look at tokenization strategies like Byte Pair Encoding (BPE) or WordPiece in more detail in Chapter 4, but for now, assume we have a sequence of these integer IDs representing our input sentence.
Each unique ID in the vocabulary is associated with a vector, known as its embedding. The embedding layer maintains a large matrix, often called the embedding matrix or weight matrix. The size of this matrix is V×dmodel, where V is the size of our vocabulary (the total number of unique tokens the model knows) and dmodel is the dimensionality of the embedding vectors (and consequently, the dimensionality used throughout most layers of the Transformer model). In the original "Attention Is All You Need" paper, dmodel was set to 512.
When a sequence of token IDs enters the embedding layer, the layer essentially performs a lookup. For each token ID in the input sequence, it retrieves the corresponding row vector from the embedding matrix.
For example, if our input sequence is represented by the IDs [101, 7592, 2054, 102]
and dmodel is 512, the embedding layer will output a sequence of four vectors, where each vector has 512 dimensions. The first vector is the embedding for token ID 101, the second is the embedding for token ID 7592, and so on.
Each token ID in the input sequence is used to look up its corresponding dense vector representation in the embedding matrix.
These embedding vectors are not just arbitrary representations. They are learnable parameters. Initially, they might be random values, but during the model's training process, these vectors are adjusted via backpropagation. The goal is for the learned embeddings to capture semantic similarities between tokens. For instance, after successful training, the embedding vector for "king" might be geometrically close to the vector for "queen" in the high-dimensional embedding space, reflecting their semantic relationship. This dense representation is far more powerful and efficient than sparse representations like one-hot encoding, especially for large vocabularies.
So, the output of the Input Embedding layer is a sequence of vectors, one for each token in the original input sequence, where each vector has the dimension dmodel. This sequence of vectors now holds richer, distributed information about the input tokens.
However, these embeddings only represent the meaning of the tokens themselves. If we simply fed this sequence of vectors directly into the subsequent layers, the model would have no information about the order of the tokens in the sequence. Since the Transformer architecture doesn't use recurrence like RNNs, it doesn't inherently process tokens sequentially. It needs an explicit way to understand token positions. This brings us to the next critical component: Positional Encoding.
© 2025 ApX Machine Learning