Masterclass
Selecting the vocabulary size, often denoted as ∣V∣, for your subword tokenizer is a significant hyperparameter choice with direct consequences for model performance, memory consumption, and computational cost. Unlike simpler tokenization methods where vocabulary might grow organically, subword algorithms like BPE or WordPiece require you to pre-define a target vocabulary size. This decision involves balancing several competing factors.
A smaller vocabulary size forces the tokenizer to break words down into smaller subword units more frequently.
Conversely, a larger vocabulary size allows the tokenizer to represent more common words or frequent subword sequences as single tokens.
Consider the word "tokenization".
['tok', 'en', 'ization']
(3 tokens).['tokenization']
(1 token).['token', 'ization']
(2 tokens).Longer sequences directly impact the memory needed for activations during training, especially the attention score matrix (L×L).
Larger vocabularies generally lead to shorter average sequence lengths for a fixed corpus, reducing attention computation costs.
The vocabulary size directly determines the size of the input embedding matrix and the output projection layer (often tied). The embedding matrix has dimensions ∣V∣×dmodel, where dmodel is the hidden dimension of the model.
Embedding Matrix Size=∣V∣×dmodel×bytes_per_parameterA larger ∣V∣ leads to a proportionally larger embedding matrix. For large models with dmodel in the thousands, increasing ∣V∣ from 30,000 to 100,000 can add billions of parameters and gigabytes to the model's memory footprint, solely in the embedding layer.
import torch
import torch.nn as nn
# Example embedding layer sizes
d_model = 4096 # Example hidden dimension
vocab_size_small = 32000
vocab_size_large = 128000
embedding_small = nn.Embedding(vocab_size_small, d_model)
embedding_large = nn.Embedding(vocab_size_large, d_model)
params_small = sum(p.numel() for p in embedding_small.parameters())
params_large = sum(p.numel() for p in embedding_large.parameters())
# Memory assuming float32 (4 bytes per parameter)
mem_small_gb = params_small * 4 / (1024**3)
mem_large_gb = params_large * 4 / (1024**3)
print(
f"Vocab Size: {vocab_size_small}, "
f"Embedding Params: {params_small:,}, "
f"Memory: {mem_small_gb:.2f} GB"
)
print(
f"Vocab Size: {vocab_size_large}, "
f"Embedding Params: {params_large:,}, "
f"Memory: {mem_large_gb:.2f} GB"
)
# --- Output ---
# Vocab Size: 32000, Embedding Params: 131,072,000, Memory: 0.49 GB
# Vocab Size: 128000, Embedding Params: 524,288,000, Memory: 1.95 GB
This calculation only covers the embedding layer. The final output layer, which projects the hidden state back to the vocabulary dimension for predicting the next token, often has a similar size (dmodel×∣V∣), further amplifying the impact of ∣V∣ on model size.
During training and inference, the model typically calculates probabilities over the entire vocabulary using a softmax function applied to the output logits. The computational cost of this final softmax layer scales linearly with ∣V∣.
Softmax Cost∝L×dmodel×∣V∣While the attention mechanism (O(L2×dmodel)) often dominates for long sequences, the softmax computation (O(L×dmodel×∣V∣)) can become a significant bottleneck, especially with very large vocabularies or during inference where batch sizes might be small and latency is important. A larger ∣V∣ directly increases this cost.
There's no single "best" vocabulary size; it depends on the specific task, language(s), dataset size, model architecture, and available computational resources. However, some common observations and practices exist:
Trade-offs associated with selecting a smaller versus a larger vocabulary size (∣V∣).
In practice, selecting ∣V∣ often involves some empirical testing or adopting sizes reported in literature for similar model scales and datasets. You need to weigh the benefits of shorter sequences and potentially better representation of common terms (larger ∣V∣) against the costs of increased memory usage, slower softmax computation, and potentially worse handling of rare terms (smaller ∣V∣).
© 2025 ApX Machine Learning