Computers don't understand words and sentences the way humans do. They operate on numbers. So, the first hurdle in getting a machine to process language is converting text into a numerical format that the model can actually work with. This process involves two main steps: breaking the text down into manageable pieces called tokens, and then representing these tokens as numerical lists called embeddings.
Imagine you have the sentence: "LLMs learn language patterns."
Instead of looking at the whole sentence at once, or even word by word, LLMs often break text down into smaller units called tokens. A token might be a whole word, a part of a word (a subword), or even just punctuation.
The process of breaking text into tokens is called tokenization. How a specific piece of text is tokenized depends on the tokenizer used, which is typically chosen and trained alongside the LLM itself.
Consider these examples:
"Large Language Models"
might become ["Large", " Language", " Models"]
(three tokens)"tokenization"
might become ["token", "ization"]
(two tokens, capturing the root and suffix)"isn't"
might become ["is", "n't"]
(two tokens)" U.S.A."
might become [" U", ".", "S", ".", "A", "."]
(six tokens, including spaces and punctuation)Here's a simple visualization of how a sentence might be tokenized:
A sentence is broken down into individual tokens. Notice that spaces can sometimes be part of a token (like " learn").
Why use subwords? Tokenizing into subwords helps the model handle unfamiliar words or variations. If the model knows "token" and "ization", it might be able to understand "tokenization" even if it hasn't seen that specific word frequently during training. It also keeps the total number of unique tokens (the vocabulary) manageable.
Once tokenized, each unique token in the model's vocabulary is assigned a specific integer ID.
5842
3410
1 linguagem
11394
13
So, our sentence numerically becomes a sequence of IDs: [5842, 3410, 1 linguagem, 11394, 13]
.
These integer IDs tell the model which token it's seeing, but they don't inherently capture the meaning or the relationships between tokens. The ID 3410
(" learn") isn't inherently related to the ID 1 linguagem
(" language") just because they appear together often.
This is where embeddings come in. An embedding is a dense list of numbers, also known as a vector, that represents a token. Instead of a single ID, each token is mapped to a vector of maybe hundreds or thousands of dimensions (numbers).
token_ID: 5842 ("LLMs")
-> embedding_vector: [0.12, -0.45, 0.67, ..., -0.09]
(e.g., 768 numbers long)token_ID: 3410 (" learn")
-> embedding_vector: [-0.23, 0.01, 0.88, ..., 0.51]
token_ID: 1 linguagem (" language")
-> embedding_vector: [-0.10, 0.15, 0.91, ..., 0.44]
These embedding vectors are not manually assigned; they are learned by the model during its extensive training process. The model adjusts the values in these vectors so that tokens used in similar contexts end up having similar embedding vectors.
Think of it like assigning coordinates to each token in a high-dimensional space. Tokens with similar meanings or that are used in similar ways ("dog" and "puppy", or "run" and "ran") will have vectors that are "close" to each other in this space. Conversely, unrelated tokens ("car" and "banana") will have vectors that are far apart.
While we can't easily visualize hundreds or thousands of dimensions, imagine a simplified 2D space:
In this simplified 2D space, words with related meanings (like 'King' and 'Queen', or 'Man' and 'Woman') might be located closer to each other than unrelated words. Real embeddings exist in much higher dimensions.
This vector representation is much richer than a simple ID. It captures semantic relationships, grammatical roles, and contextual nuances learned from the vast amounts of text data the LLM was trained on.
In summary:
This transformation from text to meaningful numerical vectors (embeddings) is the essential first step enabling LLMs to "understand" and manipulate language. The next sections will look at how the model uses these embeddings to perform tasks like predicting the next word.
© 2025 ApX Machine Learning