For models to process text data, converting that text into a numerical representation is an essential preparatory step. This conversion, known as tokenization, typically follows data structuring and cleaning. Tokenization is more than simply splitting sentences into words; it is a precise conversion that must align perfectly with the model's original training, ensuring the model correctly interprets the grammar and semantics of the prepared data.
A foundational rule in fine-tuning is that you must use the exact same tokenizer the base model was pre-trained with. Every model on platforms like the Hugging Face Hub is bundled with its corresponding tokenizer configuration. Using a different tokenizer, for instance, applying a BERT tokenizer to a Llama model, will result in a vocabulary mismatch and lead to poor model performance, as the model will misinterpret the input token IDs.
You can easily load the correct tokenizer for any given model using the transformers library.
from transformers import AutoTokenizer
# Load the tokenizer associated with a specific model
# Replace "meta-llama/Llama-2-7b-hf" with your chosen base model
# Note: Some models may require authentication or access grants.
# For a universally accessible example, you could use "gpt2".
model_checkpoint = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
Pre-trained models rely on a set of special tokens to understand the structure of the input. These tokens are not part of natural language but serve as delimiters or signals. While их exact form varies between models, their functions are generally consistent:
<s> (start of sequence) and </s> (end of sequence) mark the beginning and end of a complete text input.[PAD] token is used to make all sequences in a batch the same length, a requirement for efficient processing on GPUs.[UNK] token is a placeholder for any word that is not in the model's vocabulary.When fine-tuning, you often need to combine multiple text fields into a single formatted string, using these special tokens or other model-specific markers to delineate sections. For an instruction-following model, you might format your data with labels like ### Instruction: and ### Response:. This templating must be applied consistently across your entire dataset.
Let's inspect the special tokens for the GPT-2 tokenizer. Note that some models, like GPT-2, may not have a default padding token, so we often assign one, typically the end-of-sequence token, for this purpose.
# GPT-2 does not have a default padding token
# We can set it to the end-of-sequence token
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
print("End of Sequence Token:", tokenizer.eos_token)
print("Padding Token:", tokenizer.pad_token)
The diagram below shows the flow from raw text to a model-ready, tokenized input.
The tokenization pipeline begins with raw data, formats it into a prompt, and uses the model's tokenizer to produce numerical tensors.
Models process data in batches for efficiency, which requires that every input sequence in a batch has the same length. If your data entries have varying lengths, you must apply two standard techniques during tokenization:
[PAD] token until they match the length of the longest sequence in the batch.When you pad a sequence, you must also tell the model to ignore the padding tokens during the self-attention calculation. This is accomplished with an attention mask, a binary tensor of the same shape as the input IDs. A 1 indicates a real token that the model should attend to, while a 0 indicates a padding token it should ignore.
Fortunately, Hugging Face tokenizers handle all of this automatically. When you call the tokenizer on a list of texts, you can enable these features with simple arguments.
An illustration of a tokenized batch. In the attention mask (right), blue (1) indicates real tokens and gray (0) indicates padding tokens that the model will ignore. Notice that the shorter first sequence is padded on the left.
For decoder-only models (like GPT and Llama families), it is standard practice to pad on the left (padding_side='left'). This ensures the original, non-padded tokens are positioned at the end of the sequence, which is important for the model's causal attention mechanism during generation.
Let's tie everything together with a Python function that prepares a batch of examples. This function first formats each example using a prompt template and then tokenizes the entire batch, applying padding and truncation.
from transformers import AutoTokenizer
# For this example, we use GPT-2 and set the padding token
model_checkpoint = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokenizer.pad_token = tokenizer.eos_token
def create_prompt_and_tokenize(examples):
"""
Takes a batch of examples, creates a formatted prompt for each,
and tokenizes them.
"""
# Use a list comprehension to format each example.
# This assumes each 'example' is a dictionary with 'instruction' and 'response' keys.
formatted_prompts = [
f"Instruction: {ex['instruction']}\nResponse: {ex['response']}"
for ex in examples
]
# Tokenize the batch of formatted prompts.
tokenized_batch = tokenizer(
formatted_prompts,
padding="longest", # Pad to the length of the longest sequence.
truncation=True, # Truncate sequences that are too long.
max_length=256, # Set a max length for consistency.
return_tensors="pt" # Return PyTorch tensors.
)
return tokenized_batch
# A dummy dataset to demonstrate the function
dummy_dataset = [
{"instruction": "What is the capital of Italy?", "response": "The capital of Italy is Rome."},
{"instruction": "Summarize this text.", "response": "This is a summary."}
]
# Apply the function to the dataset
tokenized_output = create_prompt_and_tokenize(dummy_dataset)
# Inspect the output
print(tokenized_output.keys())
# Expected output: dict_keys(['input_ids', 'attention_mask'])
print("\nShape of Input IDs:", tokenized_output['input_ids'].shape)
# Expected output: Shape of Input IDs: torch.Size([2, 21])
This function is the final building block of our data preparation pipeline. It can be applied to an entire dataset object (e.g., a Hugging Face Dataset) using the .map() method, efficiently preparing all your data for the training phase. With a properly tokenized dataset, you are now ready to begin the fine-tuning process.
Cleaner syntax. Built-in debugging. Production-ready from day one.
Built for the AI systems behind ApX Machine Learning
Was this section helpful?
© 2026 ApX Machine LearningEngineered with