To prepare for injecting sequence order information, the Transformer requires a method to represent input tokens in a format suitable for neural network processing. Raw text is first converted into a sequence of integers, known as token IDs, through a tokenization process (using methods like Byte Pair Encoding or WordPiece). These discrete token IDs, however, are not directly suitable for input into the attention mechanism or subsequent neural network layers. They lack a way of understanding similarity or relationship between different tokens.The Input Embedding Layer serves as the initial transformation step, converting these integer token IDs into dense, continuous vector representations. This contrasts sharply with older techniques like one-hot encoding, which produce very high-dimensional and sparse vectors (mostly zeros, with a single one). Dense embeddings, on the other hand, capture semantic similarities between tokens in a much lower-dimensional space, typically referred to as $d_{model}$ (the model's hidden dimension).Mechanism: The Embedding LookupAt its core, the input embedding layer functions as a lookup table. This table is represented by a weight matrix, let's call it $E$, with dimensions $V \times d_{model}$, where $V$ is the size of the vocabulary (the total number of unique tokens the model recognizes) and $d_{model}$ is the chosen embedding dimension.When presented with an input sequence of token IDs, $[t_1, t_2, ..., t_n]$, the embedding layer retrieves the corresponding vector for each token ID from the matrix $E$. If $t_k$ is the token ID at position $k$, its embedding vector $e_k$ is simply the row in the matrix $E$ indexed by $t_k$:$$ e_k = E_{t_k} $$This operation effectively maps each integer ID in the input sequence to a dense vector of size $d_{model}$.digraph G { rankdir=LR; node [shape=plaintext, fontsize=10]; subgraph cluster_0 { label = "Input Token IDs"; bgcolor="#e9ecef"; style=filled; "Input" [label="[ 71, 8, 1234, 5 ]\n(Sequence Length = 4)"]; } subgraph cluster_1 { label = "Embedding Matrix E (V x d_model)"; bgcolor="#d0bfff"; style=filled; "EmbeddingTable" [label=< <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0"> <TR><TD BGCOLOR="#dee2e6">ID</TD><TD>Embedding Vector (d_model)</TD></TR> <TR><TD>0</TD><TD>[0.1, -0.2, ...]</TD></TR> <TR><TD>...</TD><TD>...</TD></TR> <TR><TD BGCOLOR="#91a7ff">8</TD><TD BGCOLOR="#91a7ff">[0.5, 0.1, ...]</TD></TR> <TR><TD>...</TD><TD>...</TD></TR> <TR><TD BGCOLOR="#91a7ff">71</TD><TD BGCOLOR="#91a7ff">[-0.3, 0.9, ...]</TD></TR> <TR><TD>...</TD><TD>...</TD></TR> <TR><TD BGCOLOR="#91a7ff">1234</TD><TD BGCOLOR="#91a7ff">[0.8, -0.1, ...]</TD></TR> <TR><TD>...</TD><TD>...</TD></TR> <TR><TD BGCOLOR="#91a7ff">5</TD><TD BGCOLOR="#91a7ff">[0.0, 0.4, ...]</TD></TR> <TR><TD>...</TD><TD>...</TD></TR> <TR><TD>V-1</TD><TD>[...]</TD></TR> </TABLE> >]; } subgraph cluster_2 { label = "Output Embeddings"; bgcolor="#96f2d7"; style=filled; "Output" [label=< <TABLE BORDER="0" CELLBORDER="0" CELLSPACING="2"> <TR><TD BGCOLOR="#a5d8ff">[-0.3, 0.9, ...]</TD></TR> <TR><TD BGCOLOR="#a5d8ff">[0.5, 0.1, ...]</TD></TR> <TR><TD BGCOLOR="#a5d8ff">[0.8, -0.1, ...]</TD></TR> <TR><TD BGCOLOR="#a5d8ff">[0.0, 0.4, ...]</TD></TR> </TABLE> >]; "Dim" [label="(Sequence Length x d_model)"]; } "Input" -> "EmbeddingTable" [label="Lookup", style=dashed, arrowhead=open, constraint=false]; "EmbeddingTable" -> "Output"; }Mapping discrete token IDs to continuous vector representations via embedding lookup.Learnable RepresentationsSignificantly, the embedding matrix $E$ is not fixed; its values are learnable parameters. During the training process, gradients are backpropagated through the network, and the embedding vectors are adjusted to minimize the overall loss function. This means the model learns to place tokens with similar semantic meanings or functions closer together in the $d_{model}$-dimensional embedding space. For example, words like "king" and "queen" might end up having similar embedding vectors after training, reflecting their semantic relationship.Output and DimensionalityThe output of the input embedding layer is a sequence of vectors, $[e_1, e_2, ..., e_n]$, where each $e_k$ is a vector of size $d_{model}$. This sequence now represents the input tokens in a continuous space but retains the original sequence length. The dimensionality $d_{model}$ (e.g., 512, 768, 1024) is a fundamental hyperparameter of the Transformer architecture, determining the width of the vector representations throughout most of the model's layers.This sequence of token embeddings is the foundation upon which positional information will be added, as discussed next. While these embeddings now carry semantic meaning learned from data, they still lack any inherent representation of their position within the original sequence, a limitation addressed by positional encodings.