Defining Training Arguments and Hyperparameters

Executing a training loop allows a model to learn from an instruction dataset. Controlling this process requires defining training arguments. The TrainingArguments class in the Hugging Face Transformers library acts as the central control panel for the training loop, dictating how the model processes data, updates its weights, and utilizes hardware resources.

The most immediate constraint when fine-tuning a small language model is GPU memory (VRAM). The per_device_train_batch_size argument determines how many examples the model processes simultaneously in a single forward and backward pass. If this value is set too high, you will encounter Out of Memory (OOM) errors. If set too low, the optimization process becomes unstable, and training takes significantly longer.

To resolve this tension, we use gradient accumulation. The gradient_accumulation_steps parameter allows you to simulate a larger batch size by accumulating gradients over multiple smaller batches before performing a weight update.

The effective batch size is calculated using the following formula:

$B_{effective} = B_{micro} \times S_{accumulation} \times N_{gpu}$

Here, $B_{effective}$ represents the effective batch size, $B_{micro}$ is the per-device batch size, $S_{accumulation}$ is the number of accumulation steps, and $N_{gpu}$ is the total number of active GPUs. If your hardware can only handle a micro-batch size of 2, but you want an effective batch size of 16 for stable learning on a single GPU, you would set your accumulation steps to 8.

Gradient accumulation process aggregating backward passes from multiple micro-batches before a single weight update.

The length of your training run is governed by either num_train_epochs or max_steps. An epoch represents one complete pass through the entire training dataset. For small language models adapting to a specific task, 1 to 3 epochs is often sufficient. Training for too many epochs risks memorizing the training data instead of learning general patterns. Alternatively, you can cap the training using max_steps, which overrides the epoch count and stops training after a specific number of optimization steps.

Selecting the right optimizer is essential for memory management during Parameter-Efficient Fine-Tuning. The optim argument is frequently set to paged_adamw_32bit or paged_adamw_8bit. Paged optimizers utilize the unified memory features of modern GPUs to move optimizer states to CPU RAM when VRAM is near capacity. This paging mechanism prevents crashes during sudden memory spikes.

Additionally, mixed precision training drastically reduces memory consumption and speeds up computation. You can enable this by setting fp16=True or bf16=True. If your hardware supports it, bf16 (Bfloat16) is highly recommended. Bfloat16 maintains the same dynamic range as standard 32-bit floating-point numbers. This prevents the numerical underflow and overflow issues sometimes encountered with standard 16-bit floats during the gradient calculation of language models.

Here is how these arguments are assembled in code:

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./slm-finetuned-model",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    num_train_epochs=3,
    optim="paged_adamw_32bit",
    bf16=True,
    logging_steps=10,
    save_strategy="epoch",
    learning_rate=2e-4,
    max_grad_norm=0.3
)

In this configuration, learning_rate defines the initial step size for the optimizer. Setting this value correctly requires balance. A value of 2e-4 is standard for Low-Rank Adaptation (LoRA), as smaller parameter subsets typically require larger learning rates than full fine-tuning.

The max_grad_norm parameter is a safeguard. It prevents exploding gradients by clipping them if their combined norm exceeds the specified threshold, which in this case is 0.3. Finally, logging_steps dictates how frequently the loss metrics are printed to the console, while save_strategy defines when the model checkpoints are written to disk. Saving checkpoints per epoch ensures you have fallback weights available if the model begins to overfit toward the end of the run.

References

Transformers Documentation: TrainingArguments, Hugging Face, 2024 - Official technical specification for the TrainingArguments class, covering all parameters for hardware allocation and optimization.
QLoRA: Efficient Finetuning of Quantized LLMs, Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, 2023 Advances in Neural Information Processing Systems, Vol. 36 DOI: 10.48550/arXiv.2305.14314 - The primary research paper introducing paged optimizers and methods for memory-efficient fine-tuning.
LoRA: Low-Rank Adaptation of Large Language Models, Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, 2022 International Conference on Learning Representations DOI: 10.48550/arXiv.2106.09685 - Foundational paper for the LoRA technique, providing context for the learning rates and parameter efficiency mentioned in the section.
Mixed Precision Training, Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu, 2018 International Conference on Learning Representations (ICLR) DOI: 10.48550/arXiv.1710.03740 - Technical explanation of using lower-precision formats like fp16 and bf16 to reduce memory usage during training.