Practice: Executing the Training Loop

Initiating the training process involves binding the model architecture, tokenized datasets, and defined hyperparameters into an automated sequence of forward passes, loss calculations, and weight updates. The Hugging Face Trainer class manages this process by abstracting away much of the boilerplate PyTorch code while maintaining fine-grained control over the execution state.

Preparing the Dataset Subset

Before committing to a full training run that might take hours or days, it is standard practice to test the pipeline on a small subset of your data. This verifies that your memory constraints are satisfied and that the loss decreases as expected without encountering out-of-memory errors during computation.

# Select a small subset of examples for the practice run
small_train_dataset = tokenized_dataset["train"].select(range(500))
small_eval_dataset = tokenized_dataset["test"].select(range(100))

Running a restricted dataset allows you to quickly validate your data collator and ensure the inputs are properly padded to the maximum length of each batch.

Assembling the Trainer

To execute the loop, instantiate the Trainer class. You must pass the base model with its attached LoRA adapters, the training arguments defined in the previous section, the split datasets, and the data collator. The data collator prepares the raw tokenized sequences for the model by organizing them into uniform tensors.

from transformers import Trainer, DataCollatorForLanguageModeling

# Create a data collator that ignores the instruction padding
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    data_collator=data_collator,
)

By setting mlm=False, you instruct the collator to process the data for causal language modeling, which is the standard approach for generative tasks where the model predicts the next token in a sequence.

The Execution Flow

Initiate the process by calling the train method on the instantiated object.

# Start the training loop
training_results = trainer.train()

When you execute this command, the automated loop begins. The engine fetches batches of data, passes them through the model, and calculates the difference between the generated tokens and the actual target tokens using cross-entropy loss.

The sequence of operations executed during a single training iteration.

Because you are using Parameter-Efficient Fine-Tuning, the backward pass only computes gradients for the injected LoRA matrices. The frozen base model weights remain untouched, saving massive amounts of computational resources.

The cross-entropy loss for a single sequence of length $N$ is calculated as:

$L = -\frac{1}{N} \sum_{i=1}^{N} \log P(x_{i} | x_{<i})$

Here, $P(x_{i} | x_{<i})$ is the probability the model assigns to the correct next token given all previous context tokens. As the training loop runs, the optimizer adjusts the adapter weights to maximize this probability, causing the overall loss value to decrease.

Monitoring Output Logs

As the loop iterates, you will see output logs in your terminal at the intervals you specified in your training arguments. You should monitor two primary metrics: the training loss and the evaluation loss.

Training loss compared to evaluation loss over 300 steps. The divergence at the end indicates early signs of overfitting.

A successful run exhibits a steady decline in both training and evaluation loss. If the training loss continues to decrease but the evaluation loss begins to climb, your model has started to memorize the training subset. This means it is losing its ability to generalize to new data. If you observe this behavior, you can stop the training loop early or adjust your learning rate and weight decay parameters.

Saving the Final Adapters

Once the loop successfully processes the predefined number of epochs, you must save the newly trained adapter weights to disk.

# Save the trained LoRA adapters
trainer.save_model("./custom-slm-lora-adapters")

It is important to remember that because you used LoRA, you are not saving the entire multi-gigabyte language model. You are only saving the newly trained low-rank matrices, which typically take up just a few megabytes of storage. In the upcoming deployment phase, these lightweight adapter files will be loaded on top of the original base model to alter its text generation behavior.

References

LoRA: Low-Rank Adaptation of Large Language Models, Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, 2021 arXiv preprint arXiv:2106.09685 DOI: 10.48550/arXiv.2106.09685 - The foundational paper describing the Low-Rank Adaptation technique used to update model weights efficiently during the training loop.
Hugging Face's Transformers: State-of-the-Art Natural Language Processing, Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, Alexander M. Rush, 2020 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (Association for Computational Linguistics) DOI: 10.18653/v1/2020.emnlp-demos.6 - The primary documentation for the Transformers library and the Trainer class used to automate the training sequence.
Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - Chapters 6 and 8 provide the mathematical basis for backpropagation, cross-entropy loss, and optimization algorithms used in the training loop.
PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods, Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, Benjamin Bossan, 2022 (Hugging Face) - Official documentation for the PEFT library, explaining how adapters are integrated into the training process and saved independently.