Executing a training loop allows a model to learn from an instruction dataset. Controlling this process requires defining training arguments. The TrainingArguments class in the Hugging Face Transformers library acts as the central control panel for the training loop, dictating how the model processes data, updates its weights, and utilizes hardware resources.
The most immediate constraint when fine-tuning a small language model is GPU memory (VRAM). The per_device_train_batch_size argument determines how many examples the model processes simultaneously in a single forward and backward pass. If this value is set too high, you will encounter Out of Memory (OOM) errors. If set too low, the optimization process becomes unstable, and training takes significantly longer.
To resolve this tension, we use gradient accumulation. The gradient_accumulation_steps parameter allows you to simulate a larger batch size by accumulating gradients over multiple smaller batches before performing a weight update.
The effective batch size is calculated using the following formula:
Here, represents the effective batch size, is the per-device batch size, is the number of accumulation steps, and is the total number of active GPUs. If your hardware can only handle a micro-batch size of 2, but you want an effective batch size of 16 for stable learning on a single GPU, you would set your accumulation steps to 8.
Gradient accumulation process aggregating backward passes from multiple micro-batches before a single weight update.
The length of your training run is governed by either num_train_epochs or max_steps. An epoch represents one complete pass through the entire training dataset. For small language models adapting to a specific task, 1 to 3 epochs is often sufficient. Training for too many epochs risks memorizing the training data instead of learning general patterns. Alternatively, you can cap the training using max_steps, which overrides the epoch count and stops training after a specific number of optimization steps.
Selecting the right optimizer is essential for memory management during Parameter-Efficient Fine-Tuning. The optim argument is frequently set to paged_adamw_32bit or paged_adamw_8bit. Paged optimizers utilize the unified memory features of modern GPUs to move optimizer states to CPU RAM when VRAM is near capacity. This paging mechanism prevents crashes during sudden memory spikes.
Additionally, mixed precision training drastically reduces memory consumption and speeds up computation. You can enable this by setting fp16=True or bf16=True. If your hardware supports it, bf16 (Bfloat16) is highly recommended. Bfloat16 maintains the same dynamic range as standard 32-bit floating-point numbers. This prevents the numerical underflow and overflow issues sometimes encountered with standard 16-bit floats during the gradient calculation of language models.
Here is how these arguments are assembled in code:
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./slm-finetuned-model",
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
num_train_epochs=3,
optim="paged_adamw_32bit",
bf16=True,
logging_steps=10,
save_strategy="epoch",
learning_rate=2e-4,
max_grad_norm=0.3
)
In this configuration, learning_rate defines the initial step size for the optimizer. Setting this value correctly requires balance. A value of 2e-4 is standard for Low-Rank Adaptation (LoRA), as smaller parameter subsets typically require larger learning rates than full fine-tuning.
The max_grad_norm parameter is a safeguard. It prevents exploding gradients by clipping them if their combined norm exceeds the specified threshold, which in this case is 0.3. Finally, logging_steps dictates how frequently the loss metrics are printed to the console, while save_strategy defines when the model checkpoints are written to disk. Saving checkpoints per epoch ensures you have fallback weights available if the model begins to overfit toward the end of the run.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•