Tokenization for Fine-Tuning

For models to process text data, converting that text into a numerical representation is an essential preparatory step. This conversion, known as tokenization, typically follows data structuring and cleaning. Tokenization is more than simply splitting sentences into words; it is a precise conversion that must align perfectly with the model's original training, ensuring the model correctly interprets the grammar and semantics of the prepared data.

A foundational rule in fine-tuning is that you must use the exact same tokenizer the base model was pre-trained with. Every model on platforms like the Hugging Face Hub is bundled with its corresponding tokenizer configuration. Using a different tokenizer, for instance, applying a BERT tokenizer to a Llama model, will result in a vocabulary mismatch and lead to poor model performance, as the model will misinterpret the input token IDs.

You can easily load the correct tokenizer for any given model using the transformers library.

from transformers import AutoTokenizer

# Load the tokenizer associated with a specific model
# Replace "meta-llama/Llama-2-7b-hf" with your chosen base model
# Note: Some models may require authentication or access grants.
# For a universally accessible example, you could use "gpt2".

model_checkpoint = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Structuring Input with Special Tokens

Pre-trained models rely on a set of special tokens to understand the structure of the input. These tokens are not part of natural language but serve as delimiters or signals. While их exact form varies between models, their functions are generally consistent:

Sequence Delimiters: Tokens like <s> (start of sequence) and </s> (end of sequence) mark the beginning and end of a complete text input.
Padding Token: The [PAD] token is used to make all sequences in a batch the same length, a requirement for efficient processing on GPUs.
Unknown Token: The [UNK] token is a placeholder for any word that is not in the model's vocabulary.

When fine-tuning, you often need to combine multiple text fields into a single formatted string, using these special tokens or other model-specific markers to delineate sections. For an instruction-following model, you might format your data with labels like ### Instruction: and ### Response:. This templating must be applied consistently across your entire dataset.

Let's inspect the special tokens for the GPT-2 tokenizer. Note that some models, like GPT-2, may not have a default padding token, so we often assign one, typically the end-of-sequence token, for this purpose.

# GPT-2 does not have a default padding token
# We can set it to the end-of-sequence token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print("End of Sequence Token:", tokenizer.eos_token)
print("Padding Token:", tokenizer.pad_token)

The diagram below shows the flow from raw text to a model-ready, tokenized input.

The tokenization pipeline begins with raw data, formats it into a prompt, and uses the model's tokenizer to produce numerical tensors.

Padding, Truncation, and Attention Masks

Models process data in batches for efficiency, which requires that every input sequence in a batch has the same length. If your data entries have varying lengths, you must apply two standard techniques during tokenization:

Truncation: Sequences longer than the model's maximum context window (e.g., 4096 tokens for Llama 2) are cut down to fit. Any information past this limit is lost, so it's important to ensure your formatted prompts do not regularly exceed this length.
Padding: Shorter sequences are filled with the [PAD] token until they match the length of the longest sequence in the batch.

When you pad a sequence, you must also tell the model to ignore the padding tokens during the self-attention calculation. This is accomplished with an attention mask, a binary tensor of the same shape as the input IDs. A 1 indicates a real token that the model should attend to, while a 0 indicates a padding token it should ignore.

Fortunately, Hugging Face tokenizers handle all of this automatically. When you call the tokenizer on a list of texts, you can enable these features with simple arguments.

An illustration of a tokenized batch. In the attention mask (right), blue (1) indicates real tokens and gray (0) indicates padding tokens that the model will ignore. Notice that the shorter first sequence is padded on the left.

For decoder-only models (like GPT and Llama families), it is standard practice to pad on the left (padding_side='left'). This ensures the original, non-padded tokens are positioned at the end of the sequence, which is important for the model's causal attention mechanism during generation.

A Practical Tokenization Function

Let's tie everything together with a Python function that prepares a batch of examples. This function first formats each example using a prompt template and then tokenizes the entire batch, applying padding and truncation.

from transformers import AutoTokenizer

# For this example, we use GPT-2 and set the padding token
model_checkpoint = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokenizer.pad_token = tokenizer.eos_token

def create_prompt_and_tokenize(examples):
    """
    Takes a batch of examples, creates a formatted prompt for each,
    and tokenizes them.
    """
    # Use a list comprehension to format each example.
    # This assumes each 'example' is a dictionary with 'instruction' and 'response' keys.
    formatted_prompts = [
        f"Instruction: {ex['instruction']}\nResponse: {ex['response']}"
        for ex in examples
    ]

    # Tokenize the batch of formatted prompts.
    tokenized_batch = tokenizer(
        formatted_prompts,
        padding="longest",      # Pad to the length of the longest sequence.
        truncation=True,        # Truncate sequences that are too long.
        max_length=256,         # Set a max length for consistency.
        return_tensors="pt"     # Return PyTorch tensors.
    )
    return tokenized_batch

# A dummy dataset to demonstrate the function
dummy_dataset = [
    {"instruction": "What is the capital of Italy?", "response": "The capital of Italy is Rome."},
    {"instruction": "Summarize this text.", "response": "This is a summary."}
]

# Apply the function to the dataset
tokenized_output = create_prompt_and_tokenize(dummy_dataset)

# Inspect the output
print(tokenized_output.keys())
# Expected output: dict_keys(['input_ids', 'attention_mask'])

print("\nShape of Input IDs:", tokenized_output['input_ids'].shape)
# Expected output: Shape of Input IDs: torch.Size([2, 21])

This function is the final building block of our data preparation pipeline. It can be applied to an entire dataset object (e.g., a Hugging Face Dataset) using the .map() method, efficiently preparing all your data for the training phase. With a properly tokenized dataset, you are now ready to begin the fine-tuning process.

Build LLM apps faster with Kerb

Cleaner syntax. Built-in debugging. Production-ready from day one.

Built for the AI systems behind ApX Machine Learning

Was this section helpful?

References

Preprocessing data: Tokenization, Hugging Face, 2023 (Hugging Face) - Official guide to tokenization within the Hugging Face Transformers library, detailing concepts like special tokens, padding, and truncation.
Neural Machine Translation of Rare Words with Subword Units, Rico Sennrich, Barry Haddow, Alexandra Birch, 2016 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Association for Computational Linguistics) DOI: 10.18653/v1/P16-1162 - Introduces Byte Pair Encoding (BPE), a subword tokenization algorithm foundational to many LLMs.
Language Models are Unsupervised Multitask Learners, Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever, 2019 (OpenAI) - The paper introducing GPT-2, it outlines the model's architecture and the BPE tokenization approach it uses.
Natural Language Processing with Transformers: Building Innovative Applications with Hugging Face, Lewis Tunstall, Leandro von Werra, and Mario Šafranko, 2022 (O'Reilly Media) - A practical resource covering data preparation, tokenization, and fine-tuning with the Hugging Face ecosystem.