Hands-On Practical: Configuring the Training Script

Configuring the foundational training script is the first step in the fine-tuning process. This script acts as the skeleton for model training, integrating PyTorch for tensor operations, the Hugging Face Transformers library for model management, and Accelerate to handle device placement automatically.

Importing Necessary Libraries

Begin by creating a new Python file named train.py. The first block of code will import the specific modules needed to load the model, process the data, and manage the hardware.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from accelerate import Accelerator
from torch.utils.data import DataLoader
from torch.optim import AdamW

The AutoModelForCausalLM class is specifically designed for text generation tasks where the model predicts the next token in a sequence. The Accelerator class will manage the interaction between your PyTorch code and the underlying CUDA hardware.

Hardware Abstraction with Accelerate

Managing hardware device placement manually requires adding .to('cuda') or .to('cpu') to every tensor and model instance. This approach becomes difficult to maintain as your code grows. The Accelerate library simplifies this by automatically detecting your hardware setup and handling the distribution of tensors behind the scenes.

accelerator = Accelerator()
device = accelerator.device

By initializing the Accelerator object early in the script, you establish a unified controller for all subsequent memory allocations.

Loading the Tokenizer and Model

Next, you must load the pre-trained Small Language Model and its corresponding tokenizer into memory. For this exercise, assume you are working with a lightweight model with around 500 million parameters.

model_id = "Qwen/Qwen2.5-0.5B"
tokenizer = AutoTokenizer.from_pretrained(model_id)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_id)

Notice the condition checking for a pad_token. Many base models are trained without a dedicated padding token. When fine-tuning with batched data, shorter sequences must be padded to match the length of the longest sequence in the batch. Reusing the end-of-sequence token (eos_token) is a common and effective workaround that prevents errors during matrix operations.

Preparing the Dataset and DataLoader

You previously learned how to format data into instruction-response pairs. Now, you will load that formatted data and prepare it for the training loop. The script needs a DataLoader to iterate through the dataset in fixed batch sizes.

dataset = load_dataset("json", data_files="formatted_data.jsonl", split="train")

def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=256
    )

tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset.set_format("torch")

batch_size = 4
dataloader = DataLoader(tokenized_dataset, batch_size=batch_size, shuffle=True)

The batch size directly impacts the total number of update steps required to complete an epoch. You can calculate the total training steps using the following equation:

$S = \frac{E \times N}{B}$

Here, $S$ represents the total training steps, $E$ is the number of epochs, $N$ is the total number of examples in the dataset, and $B$ is the batch size. A smaller batch size requires less VRAM but increases the total number of parameter updates per epoch.

Initializing the Optimizer and Wrapping Components

The optimizer determines how the model weights are updated based on the calculated gradients. The AdamW optimizer is the standard choice for training transformer architectures because it handles weight decay effectively.

learning_rate = 5e-5
optimizer = AdamW(model.parameters(), lr=learning_rate)

The final step in your setup script is to pass the model, optimizer, and dataloader to the accelerator's prepare method.

model, optimizer, dataloader = accelerator.prepare(
    model, optimizer, dataloader
)

This method inspects your environment, wraps the PyTorch modules in specialized classes, and ensures that all data batches and model parameters are automatically moved to the GPU during training.

Pipeline for initializing the training script components before executing the fine-tuning loop.

Your script is now structured to handle local training efficiently. The data is properly batched, the hardware is managed dynamically, and the model is loaded into memory. In the next phases of the fine-tuning process, you will modify this base model by injecting parameter-efficient adapters to reduce memory consumption further.

References

Hugging Face Accelerate: Training and Inference at Scale Made Simple, Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Mangrulkar, Marc Sun, Benjamin Bossan, 2022 Hugging Face Documentation - Official documentation for the Accelerate library, explaining the design of the Accelerator class and hardware abstraction.
Decoupled Weight Decay Regularization, Ilya Loshchilov, Frank Hutter, 2019 International Conference on Learning Representations (ICLR) DOI: 10.48550/arXiv.1711.05101 - The original research paper for the AdamW optimizer, explaining why it is preferred over standard Adam for transformer models.
Natural Language Processing with Transformers, Lewis Tunstall, Leandro von Werra, Thomas Wolf, 2022 (O'Reilly Media) - A comprehensive guide to using the Hugging Face ecosystem, specifically covering tokenization and model loading.
Qwen2.5 Technical Report, Qwen Team, 2024 arXiv preprint DOI: 10.48550/arXiv.2412.15115 - Technical details regarding the Qwen2.5 model family, including the architecture of the 0.5B parameter version used in this section.