Tokenization and Padding Strategies

Text models process arrays of numbers rather than characters. Training a Small Language Model requires converting human-readable instruction datasets into sequences of integers. This translation is handled by a tokenizer. Most modern language models use subword tokenization algorithms such as Byte-Pair Encoding or SentencePiece. Instead of mapping entire words to single integers or mapping individual characters, subword algorithms create a balance. They split rare words into smaller, frequent chunks and assign a specific integer ID to each chunk.

When you format a dataset for training, you do not process one sequence at a time. Data is processed in batches to optimize GPU matrix operations. A neural network requires a perfectly rectangular tensor of shape (batch_size, sequence_length) to perform these operations. Natural language sequences, however, naturally vary in length.

To achieve uniform matrix dimensions across a batch, you apply padding. If you have a batch with sequences of varying lengths, you identify the maximum length in that specific batch. Let this maximum length be $L_{max}$ . For any sequence in the batch with a length $N$ where $N < L_{max}$ , you append a dedicated padding token ID to the sequence. The number of padding tokens $P$ added to each sequence is calculated as follows:

P = L_{max} - N

By appending $P$ padding tokens, every sequence in the batch reaches the exact length of $L_{max}$ , resulting in a valid rectangular tensor.

Process of converting raw text sequences of varying lengths into a uniform padded tensor matrix for batch processing.

Truncation Strategies

While padding ensures shorter sequences meet the required matrix dimensions, you must also manage sequences that exceed the model's maximum context window. If a sequence length $N$ is greater than the absolute maximum length the architecture can process, it will cause an out-of-memory error or tensor mismatch.

Truncation solves this by cutting off the sequence at the maximum allowed length. In supervised fine-tuning, you typically truncate the end of the sequence. However, you must be careful with instruction datasets. If you truncate the response portion of an instruction-response pair, the model will learn to generate incomplete answers. A common strategy is to filter out excessively long sequences during the initial data preparation phase to avoid aggressive truncation during tokenization.

Implementing Tokenization and Padding

Using the Hugging Face transformers library, you can handle tokenization, padding, and truncation in a single function call. Many Small Language Models do not define a dedicated padding token by default. In these scenarios, it is standard practice to assign the End of Sequence (EOS) token as the padding token.

from transformers import AutoTokenizer

# Initialize the tokenizer from a pre-trained SLM
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM-135M")

# Assign the EOS token as the padding token if one is not set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Sample instruction dataset entries
texts = [
    "Translate the following sentence to French.",
    "Say hi.",
    "Write a python script to calculate the fibonacci sequence."
]

# Apply tokenization, padding, and truncation
encoded_inputs = tokenizer(
    texts,
    padding="longest",
    truncation=True,
    max_length=256,
    return_tensors="pt"
)

print(encoded_inputs["input_ids"])

In the implementation above, the padding="longest" argument instructs the tokenizer to pad the shorter sequences to match the length of the longest sequence in the current batch. This is highly memory efficient compared to padding="max_length", which would force every sequence to length 256 regardless of the actual text lengths in the batch. The return_tensors="pt" argument ensures the output is formatted as a PyTorch tensor, ready to be moved to a GPU.

Padding Direction

The direction in which you apply padding matters depending on the task. For training autoregressive causal language models, you apply right-padding. This means the padding tokens are appended to the end of the sequence. During the forward pass, the model predicts the next token from left to right, making right-padding the logical choice for maintaining sequence integrity.

Left-padding, where tokens are added to the beginning of the sequence, is occasionally used during batched inference. It ensures the final generated tokens align perfectly on the right edge of the matrix. For fine-tuning tasks, however, your data pipeline should be configured to use right-padding.

References

Neural Machine Translation of Rare Words with Subword Units, Rico Sennrich, Barry Haddow, Alexandra Birch, 2016 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Vol. 1 (Association for Computational Linguistics (ACL)) DOI: 10.18653/v1/P16-1162 - The foundational paper introducing Byte-Pair Encoding (BPE) for subword tokenization in neural networks.
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing, Taku Kudo, John Richardson, 2018 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (Association for Computational Linguistics) DOI: 10.18653/v1/D18-2012 - Documentation of the SentencePiece library, which is widely used by modern language models for language-independent tokenization.
Padding and Truncation, Hugging Face, 2024 Hugging Face Transformers Documentation - Official technical guide for implementing tokenization strategies using the Transformers library.
Speech and Language Processing, Daniel Jurafsky, James H. Martin, 2024 (Stanford University) - 3rd edition (draft). A comprehensive resource covering the mechanics of language modeling and text preprocessing.