Configuring the foundational training script is the first step in the fine-tuning process. This script acts as the skeleton for model training, integrating PyTorch for tensor operations, the Hugging Face Transformers library for model management, and Accelerate to handle device placement automatically.
Begin by creating a new Python file named train.py. The first block of code will import the specific modules needed to load the model, process the data, and manage the hardware.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from accelerate import Accelerator
from torch.utils.data import DataLoader
from torch.optim import AdamW
The AutoModelForCausalLM class is specifically designed for text generation tasks where the model predicts the next token in a sequence. The Accelerator class will manage the interaction between your PyTorch code and the underlying CUDA hardware.
Managing hardware device placement manually requires adding .to('cuda') or .to('cpu') to every tensor and model instance. This approach becomes difficult to maintain as your code grows. The Accelerate library simplifies this by automatically detecting your hardware setup and handling the distribution of tensors behind the scenes.
accelerator = Accelerator()
device = accelerator.device
By initializing the Accelerator object early in the script, you establish a unified controller for all subsequent memory allocations.
Next, you must load the pre-trained Small Language Model and its corresponding tokenizer into memory. For this exercise, assume you are working with a lightweight model with around 500 million parameters.
model_id = "Qwen/Qwen2.5-0.5B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_id)
Notice the condition checking for a pad_token. Many base models are trained without a dedicated padding token. When fine-tuning with batched data, shorter sequences must be padded to match the length of the longest sequence in the batch. Reusing the end-of-sequence token (eos_token) is a common and effective workaround that prevents errors during matrix operations.
You previously learned how to format data into instruction-response pairs. Now, you will load that formatted data and prepare it for the training loop. The script needs a DataLoader to iterate through the dataset in fixed batch sizes.
dataset = load_dataset("json", data_files="formatted_data.jsonl", split="train")
def tokenize_function(examples):
return tokenizer(
examples["text"],
truncation=True,
padding="max_length",
max_length=256
)
tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset.set_format("torch")
batch_size = 4
dataloader = DataLoader(tokenized_dataset, batch_size=batch_size, shuffle=True)
The batch size directly impacts the total number of update steps required to complete an epoch. You can calculate the total training steps using the following equation:
Here, represents the total training steps, is the number of epochs, is the total number of examples in the dataset, and is the batch size. A smaller batch size requires less VRAM but increases the total number of parameter updates per epoch.
The optimizer determines how the model weights are updated based on the calculated gradients. The AdamW optimizer is the standard choice for training transformer architectures because it handles weight decay effectively.
learning_rate = 5e-5
optimizer = AdamW(model.parameters(), lr=learning_rate)
The final step in your setup script is to pass the model, optimizer, and dataloader to the accelerator's prepare method.
model, optimizer, dataloader = accelerator.prepare(
model, optimizer, dataloader
)
This method inspects your environment, wraps the PyTorch modules in specialized classes, and ensures that all data batches and model parameters are automatically moved to the GPU during training.
Pipeline for initializing the training script components before executing the fine-tuning loop.
Your script is now structured to handle local training efficiently. The data is properly batched, the hardware is managed dynamically, and the model is loaded into memory. In the next phases of the fine-tuning process, you will modify this base model by injecting parameter-efficient adapters to reduce memory consumption further.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•