Text models process arrays of numbers rather than characters. Training a Small Language Model requires converting human-readable instruction datasets into sequences of integers. This translation is handled by a tokenizer. Most modern language models use subword tokenization algorithms such as Byte-Pair Encoding or SentencePiece. Instead of mapping entire words to single integers or mapping individual characters, subword algorithms create a balance. They split rare words into smaller, frequent chunks and assign a specific integer ID to each chunk.
When you format a dataset for training, you do not process one sequence at a time. Data is processed in batches to optimize GPU matrix operations. A neural network requires a perfectly rectangular tensor of shape (batch_size, sequence_length) to perform these operations. Natural language sequences, however, naturally vary in length.
To achieve uniform matrix dimensions across a batch, you apply padding. If you have a batch with sequences of varying lengths, you identify the maximum length in that specific batch. Let this maximum length be . For any sequence in the batch with a length where , you append a dedicated padding token ID to the sequence. The number of padding tokens added to each sequence is calculated as follows:
By appending padding tokens, every sequence in the batch reaches the exact length of , resulting in a valid rectangular tensor.
Process of converting raw text sequences of varying lengths into a uniform padded tensor matrix for batch processing.
While padding ensures shorter sequences meet the required matrix dimensions, you must also manage sequences that exceed the model's maximum context window. If a sequence length is greater than the absolute maximum length the architecture can process, it will cause an out-of-memory error or tensor mismatch.
Truncation solves this by cutting off the sequence at the maximum allowed length. In supervised fine-tuning, you typically truncate the end of the sequence. However, you must be careful with instruction datasets. If you truncate the response portion of an instruction-response pair, the model will learn to generate incomplete answers. A common strategy is to filter out excessively long sequences during the initial data preparation phase to avoid aggressive truncation during tokenization.
Using the Hugging Face transformers library, you can handle tokenization, padding, and truncation in a single function call. Many Small Language Models do not define a dedicated padding token by default. In these scenarios, it is standard practice to assign the End of Sequence (EOS) token as the padding token.
from transformers import AutoTokenizer
# Initialize the tokenizer from a pre-trained SLM
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM-135M")
# Assign the EOS token as the padding token if one is not set
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Sample instruction dataset entries
texts = [
"Translate the following sentence to French.",
"Say hi.",
"Write a python script to calculate the fibonacci sequence."
]
# Apply tokenization, padding, and truncation
encoded_inputs = tokenizer(
texts,
padding="longest",
truncation=True,
max_length=256,
return_tensors="pt"
)
print(encoded_inputs["input_ids"])
In the implementation above, the padding="longest" argument instructs the tokenizer to pad the shorter sequences to match the length of the longest sequence in the current batch. This is highly memory efficient compared to padding="max_length", which would force every sequence to length 256 regardless of the actual text lengths in the batch. The return_tensors="pt" argument ensures the output is formatted as a PyTorch tensor, ready to be moved to a GPU.
The direction in which you apply padding matters depending on the task. For training autoregressive causal language models, you apply right-padding. This means the padding tokens are appended to the end of the sequence. During the forward pass, the model predicts the next token from left to right, making right-padding the logical choice for maintaining sequence integrity.
Left-padding, where tokens are added to the beginning of the sequence, is occasionally used during batched inference. It ensures the final generated tokens align perfectly on the right edge of the matrix. For fine-tuning tasks, however, your data pipeline should be configured to use right-padding.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•