After converting our text into sequences of integers, we face a potential issue. Representing words like "cat" as 5, "dog" as 12, and "house" as 87 provides a numerical format, but these numbers lack inherent meaning or relationship. The integer 12 is not intrinsically "closer" to 5 than it is to 87 in terms of meaning, even though "cat" and "dog" are semantically related animals. Furthermore, feeding large, arbitrary integers directly into a neural network isn't typically effective.
An alternative, one-hot encoding (representing "cat" as [0,0,0,0,1,0,...]
in a vector the size of the vocabulary), quickly becomes computationally infeasible and data-sparse for realistically sized vocabularies (tens of thousands of words). We need a more efficient and meaningful way to represent words numerically.
This is where embedding layers come in. The core idea is to represent each distinct token (word, character, etc.) in our vocabulary not as a single integer or a sparse vector, but as a dense, relatively low-dimensional vector of floating-point numbers. These are called word embeddings or token embeddings.
Think of an embedding layer as a trainable lookup table.
(vocabulary_size, embedding_dimension)
. Each row i in this matrix corresponds to the initial embedding vector for the token with integer index i.(batch_size, sequence_length)
, the output will be a tensor of shape (batch_size, sequence_length, embedding_dimension)
.Integer indices are used to look up corresponding dense vector representations within the Embedding layer's internal weight matrix.
The true power of embedding layers lies in the fact that the embedding vectors (the rows in the lookup table) are trainable parameters. They are typically initialized randomly and then adjusted during the network's training process via backpropagation, just like any other weights.
As the network learns to perform its task (e.g., sentiment classification, language modeling), it implicitly learns to arrange the embedding vectors in the embedding_dimension
-dimensional space such that:
vector("king") - vector("man") + vector("woman")
often results in a vector very close to vector("queen")
.This learned, dense representation provides a much richer and more useful input signal to the subsequent layers of the network, such as RNNs, LSTMs, or GRUs, compared to raw integer indices or one-hot vectors. The output shape (batch_size, sequence_length, embedding_dimension)
is exactly what recurrent layers expect as input, where the last dimension represents the features for each element in the sequence at each time step.
In practice, you rarely implement the lookup mechanism yourself. Deep learning frameworks like TensorFlow and PyTorch provide built-in Embedding
layers that handle the initialization, lookup, and training efficiently. You simply need to add this layer as the first step in your model after the input layer that receives the integer-encoded sequences.
© 2025 ApX Machine Learning