Masterclass
As discussed in the introduction, converting raw text into a format suitable for neural networks, specifically sequences of numerical IDs, is a fundamental first step. This process is called tokenization. A straightforward approach might involve splitting text based on whitespace and punctuation and assigning a unique integer ID to each distinct word encountered in the training corpus.
Consider a simple sentence: "LLMs learn representations."
A word-level tokenizer might produce: ["LLMs", "learn", "representations", "."]
If our vocabulary maps "LLMs" to 5, "learn" to 123, "representations" to 456, and "." to 7, the numerical sequence would be [5, 123, 456, 7]
.
While simple and intuitive, this word-level approach breaks down rapidly when dealing with the massive datasets required to train large language models. Several significant problems arise:
The primary issue is the sheer size of the vocabulary required. Web-scale text corpora contain not just standard dictionary words but also names, places, technical jargon, code snippets, slang, misspellings, and morphological variants (e.g., "run", "runs", "running", "ran"). A word-level vocabulary built from such data can easily swell to contain millions, or even tens of millions, of unique types.
This poses severe practical challenges:
Memory Consumption: The model's input embedding layer maps each vocabulary ID to a dense vector (e.g., 1024 dimensions or more). A vocabulary of 10 million words requires an embedding matrix of 10,000,000×dmodel​ parameters. Even with a modest dmodel​=1024, this single layer would require approximately 41 GB of memory just to store the weights in standard 32-bit floating-point format.
import torch
import torch.nn as nn
# Hypothetical parameters
d_model = 1024 # Model hidden dimension
# Word-level scenario (Illustrative)
word_vocab_size = 10_000_000
# Instantiation avoided due to size
# word_embeddings = nn.Embedding(word_vocab_size, d_model)
word_memory_bytes = word_vocab_size * d_model * 4 # FP32 = 4 bytes
word_memory_gb = word_memory_bytes / (1024**3)
print(
f"Word Embedding Memory Estimate (Vocab={word_vocab_size:,}, "
f"d_model={d_model}): {word_memory_gb:.2f} GB"
)
# Subword-level scenario (Typical)
subword_vocab_size = 50_000
subword_embeddings = nn.Embedding(subword_vocab_size, d_model)
subword_memory_bytes = subword_vocab_size * d_model * 4 # FP32
subword_memory_gb = subword_memory_bytes / (1024**3)
print(
f"Subword Embedding Memory Estimate (Vocab={subword_vocab_size:,}, "
f"d_model={d_model}): {subword_memory_gb:.2f} GB"
)
Computational Cost: The final layer of a typical language model often involves a softmax calculation over the entire vocabulary to predict the next token. The complexity of this operation is proportional to the vocabulary size, O(∣V∣). A multi-million-word vocabulary makes this final step computationally expensive during both training and inference.
Parameter Inefficiency: Language follows distributions like Zipf's law, where a small number of words occur very frequently, while the vast majority of words are extremely rare (the "long tail"). A word-level model dedicates unique embedding vectors and parameters even to words that appear only once or twice in a massive corpus, which is an inefficient allocation of model capacity.
No matter how large the training corpus, you will inevitably encounter words during inference or evaluation that were not present in the training data. These are Out-of-Vocabulary (OOV) words. Word-level tokenizers must have a strategy for handling these, typically by mapping all unknown words to a single special token, often represented as <UNK>
or [UNK]
.
Consider the sentence: "We analyzed the giga-scale dataset using GloVe embeddings."
If "giga-scale" was not in the training vocabulary, a word tokenizer might produce: ["We", "analyzed", "the", "<UNK>", "dataset", "using", "GloVe", "embeddings", "."]
Replacing words with <UNK>
leads to significant information loss. The model has no information about the specific unknown word, hindering its ability to understand or generate nuanced text containing new terms, names, misspellings, or domain-specific vocabulary. The frequency of OOV words increases when the model is applied to domains different from its training data.
Comparison of typical vocabulary sizes (log scale) and Out-Of-Vocabulary (OOV) rates for different tokenization levels on large text corpora. Word-level tokenization yields large vocabularies and higher OOV rates. Character-level has zero OOV but the smallest units. Subword methods offer a balance.
Languages form words through morphology, combining root words with prefixes, suffixes, and inflections. For example, "train", "trainer", "training", "retrain", "trained" all share the root "train". A word-level tokenizer treats each of these as a completely distinct unit with its own ID and embedding vector. This is inefficient, as it fails to leverage the inherent relationship between these related forms. The model must learn the connection between "training" and "trained" from scratch based purely on co-occurrence statistics, rather than recognizing the shared underlying morpheme. This problem is particularly acute in morphologically rich languages (like Turkish, Finnish, or German) but is also significant in English.
To address these challenges, modern LLMs almost universally employ subword tokenization algorithms. The core idea is to break words into smaller, frequently occurring units. These units might be morphemes (like "train", "ing", "er"), common character sequences, or even individual characters for the rarest parts.
For example, the word "tokenization" might be broken down into subword units like ["token", "##ization"]
, where "##" indicates the start of a continuation piece within a word (this notation varies). A rare word like "giga-scale" might become ["giga", "-", "scale"]
or perhaps ["g", "##iga", "-", "s", "##cale"]
, depending on the learned vocabulary.
This approach elegantly mitigates the problems of word-level tokenization:
<UNK>
token becomes largely unnecessary.["train", "##ing"]
inherently links to ["train", "##ed"]
through the shared "train" subword token.By operating at the subword level, we achieve a balance: the vocabulary remains manageably sized, OOV issues are virtually eliminated, and the model gains the potential to understand word structure better. The challenge then becomes: how do we determine the optimal set of subword units for a given corpus? Techniques like Byte Pair Encoding (BPE), WordPiece, and Unigram Language Model segmentation, often utilized within frameworks like SentencePiece, provide data-driven methods for constructing these subword vocabularies, as we will explore in the following sections.
© 2025 ApX Machine Learning