Now that we've explored the theoretical underpinnings of full parameter fine-tuning, including its mechanism, hyperparameter considerations, regularization needs, and resource demands, it's time to translate that knowledge into practice. This hands-on exercise guides you through the process of fully fine-tuning a relatively small pre-trained language model for a specific downstream task. While "smaller" in the context of LLMs, this process still demands computational resources, reinforcing the concepts discussed regarding resource management.
We will use the Hugging Face transformers
and datasets
libraries, which provide convenient abstractions for many common NLP tasks and models. This example focuses on a text classification task, a common application for fine-tuning.
First, ensure you have the necessary libraries installed. You'll primarily need transformers
, datasets
, torch
(or tensorflow
), and accelerate
for efficient training.
pip install transformers datasets torch accelerate scikit-learn
We assume you are working in an environment with access to a GPU, as full fine-tuning, even for smaller models, can be very slow on a CPU. The accelerate
library helps manage device placement (CPU/GPU) automatically.
For this practical exercise, we'll use distilbert-base-uncased
, a distilled version of BERT that retains significant performance while being smaller and faster. For the task, we'll use the imdb
dataset, a standard benchmark for binary sentiment classification (positive/negative movie reviews).
# Define model and dataset names
model_checkpoint = "distilbert-base-uncased"
dataset_name = "imdb"
The datasets
library makes loading standard datasets straightforward.
from datasets import load_dataset
# Load the dataset
raw_datasets = load_dataset(dataset_name)
# Display dataset structure (optional)
print(raw_datasets)
# DatasetDict({
# train: Dataset({
# features: ['text', 'label'],
# num_rows: 25000
# })
# test: Dataset({
# features: ['text', 'label'],
# num_rows: 25000
# })
# unsupervised: Dataset({
# features: ['text', 'label'],
# num_rows: 50000
# })
# })
Next, we need to tokenize the text data so the model can understand it. We use the tokenizer corresponding to our chosen pre-trained model.
from transformers import AutoTokenizer
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# Define the tokenization function
def tokenize_function(examples):
# Truncate sequences longer than the model's max input size
return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)
# Apply tokenization to the entire dataset (batched for efficiency)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
# Remove the original 'text' column as it's no longer needed
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
# Rename 'label' to 'labels' which is expected by the Trainer
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
# Set the format to PyTorch tensors
tokenized_datasets.set_format("torch")
# Create smaller subsets for quicker demonstration (optional)
# Remove or adjust these lines for a full run
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
print("\nSample tokenized training data:")
print(small_train_dataset[0])
This process converts the raw text into input IDs, attention masks, and includes the labels required for supervised learning.
We load the pre-trained DistilBERT model configured for sequence classification. AutoModelForSequenceClassification
automatically adds a classification head on top of the base DistilBERT model. The number of labels is inferred from the dataset (in this case, 2: positive/negative).
from transformers import AutoModelForSequenceClassification
# Load the model for sequence classification
# num_labels=2 for binary classification (positive/negative)
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=2)
At this point, model
contains the pre-trained weights. During full fine-tuning, all these weights, including the base model and the newly added classification head, will be updated.
The TrainingArguments
class holds all the hyperparameters and settings for the training process. This includes parameters we discussed earlier like learning rate, batch size, number of epochs, and regularization (weight decay).
from transformers import TrainingArguments
# Define output directory for checkpoints and logs
output_dir = "./results/distilbert-imdb-full"
training_args = TrainingArguments(
output_dir=output_dir,
evaluation_strategy="epoch", # Evaluate performance at the end of each epoch
save_strategy="epoch", # Save checkpoint at the end of each epoch
num_train_epochs=3, # Number of training epochs (adjust as needed)
per_device_train_batch_size=8, # Batch size per GPU
per_device_eval_batch_size=8, # Batch size for evaluation
learning_rate=2e-5, # Starting learning rate (a common value for fine-tuning)
weight_decay=0.01, # Apply weight decay for regularization
logging_dir='./logs', # Directory for storing logs
logging_steps=100, # Log training loss every 100 steps
load_best_model_at_end=True, # Load the best performing model checkpoint at the end
metric_for_best_model="accuracy", # Metric to determine the 'best' model
# Use push_to_hub=True to upload results to Hugging Face Hub (requires login)
# push_to_hub=False,
)
These arguments directly control the fine-tuning process discussed theoretically in previous sections. Choosing appropriate values (like learning rate and epochs) often requires experimentation.
To monitor performance during training, we need a function that calculates metrics based on model predictions and true labels. We'll use standard accuracy for this classification task.
import numpy as np
from datasets import load_metric
# Load the accuracy metric
metric = load_metric("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
The Trainer
class simplifies the training loop. It orchestrates data loading, model forward/backward passes, optimization, evaluation, and checkpointing based on the provided model, arguments, datasets, tokenizer, and metrics function.
from transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train_dataset, # Use full dataset for actual training
eval_dataset=small_eval_dataset, # Use full test set for proper evaluation
tokenizer=tokenizer, # Useful for padding collation
compute_metrics=compute_metrics,
)
With everything set up, initiating the full fine-tuning process is a single command:
# Start the training process
train_result = trainer.train()
# Optionally, save training metrics
trainer.log_metrics("train", train_result.metrics)
trainer.save_metrics("train", train_result.metrics)
# Save the final fine-tuned model and tokenizer
trainer.save_model(output_dir) # Saves the best model due to load_best_model_at_end=True
trainer.save_state() # Saves trainer state including RNG states
During training, you will see output logs showing the training loss decreasing and evaluation metrics (accuracy) potentially improving after each epoch. This directly corresponds to the model parameters θ being updated via gradient descent based on the loss computed on the imdb
dataset.
After training completes, you can explicitly run evaluation on the test set (or the specified eval_dataset
).
# Evaluate the final model
eval_results = trainer.evaluate()
# Print evaluation results
print(f"Evaluation results: {eval_results}")
trainer.log_metrics("eval", eval_results)
trainer.save_metrics("eval", eval_results)
The output will show the performance of your fine-tuned model on the held-out evaluation data.
Even with DistilBERT and a subset of the data, you likely observed that training required a noticeable amount of time and GPU memory. Scaling this to larger models like GPT-3 variants or Llama models necessitates significantly more resources, often involving multiple high-end GPUs and distributed training strategies, as discussed in Chapter 7. The memory footprint grows roughly quadratically with sequence length and linearly (or more, depending on the architecture) with the number of parameters being updated.
This hands-on exercise demonstrates the complete workflow for full parameter fine-tuning using standard tools. You loaded data, prepared it, configured training, executed the fine-tuning loop, and evaluated the result. This process, while effective, highlights the computational cost involved in updating every model parameter. This cost motivates the exploration of Parameter-Efficient Fine-tuning (PEFT) methods, which we will cover in the next chapter, aiming to achieve comparable adaptation results with drastically reduced computational requirements.
© 2025 ApX Machine Learning