Full parameter fine-tuning involves understanding its mechanism, hyperparameter considerations, regularization needs, and resource demands. This hands-on exercise guides you through fully fine-tuning a relatively small pre-trained language model for a specific downstream task. Even when "smaller" for LLMs, this process still demands computational resources, highlighting the importance of resource management.We will use the Hugging Face transformers and datasets libraries, which provide convenient abstractions for many common NLP tasks and models. This example focuses on a text classification task, a common application for fine-tuning.Environment SetupFirst, ensure you have the necessary libraries installed. You'll primarily need transformers, datasets, torch (or tensorflow), and accelerate for efficient training.pip install transformers datasets torch accelerate scikit-learnWe assume you are working in an environment with access to a GPU, as full fine-tuning, even for smaller models, can be very slow on a CPU. The accelerate library helps manage device placement (CPU/GPU) automatically.Choosing the Model and DatasetFor this practical exercise, we'll use distilbert-base-uncased, a distilled version of BERT that retains significant performance while being smaller and faster. For the task, we'll use the imdb dataset, a standard benchmark for binary sentiment classification (positive/negative movie reviews).# Define model and dataset names model_checkpoint = "distilbert-base-uncased" dataset_name = "imdb"Loading and Preparing the DataThe datasets library makes loading standard datasets straightforward.from datasets import load_dataset # Load the dataset raw_datasets = load_dataset(dataset_name) # Display dataset structure (optional) print(raw_datasets) # DatasetDict({ # train: Dataset({ # features: ['text', 'label'], # num_rows: 25000 # }) # test: Dataset({ # features: ['text', 'label'], # num_rows: 25000 # }) # unsupervised: Dataset({ # features: ['text', 'label'], # num_rows: 50000 # }) # })Next, we need to tokenize the text data so the model can understand it. We use the tokenizer corresponding to our chosen pre-trained model.from transformers import AutoTokenizer # Load the tokenizer tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) # Define the tokenization function def tokenize_function(examples): # Truncate sequences longer than the model's max input size return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512) # Apply tokenization to the entire dataset (batched for efficiency) tokenized_datasets = raw_datasets.map(tokenize_function, batched=True) # Remove the original 'text' column as it's no longer needed tokenized_datasets = tokenized_datasets.remove_columns(["text"]) # Rename 'label' to 'labels' which is expected by the Trainer tokenized_datasets = tokenized_datasets.rename_column("label", "labels") # Set the format to PyTorch tensors tokenized_datasets.set_format("torch") # Create smaller subsets for quicker demonstration (optional) # Remove or adjust these lines for a full run small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000)) small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000)) print("\nSample tokenized training data:") print(small_train_dataset[0])This process converts the raw text into input IDs, attention masks, and includes the labels required for supervised learning.Loading the ModelWe load the pre-trained DistilBERT model configured for sequence classification. AutoModelForSequenceClassification automatically adds a classification head on top of the base DistilBERT model. The number of labels is inferred from the dataset (in this case, 2: positive/negative).from transformers import AutoModelForSequenceClassification # Load the model for sequence classification # num_labels=2 for binary classification (positive/negative) model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=2)At this point, model contains the pre-trained weights. During full fine-tuning, all these weights, including the base model and the newly added classification head, will be updated.Configuring Training ArgumentsThe TrainingArguments class holds all the hyperparameters and settings for the training process. This includes parameters we discussed earlier like learning rate, batch size, number of epochs, and regularization (weight decay).from transformers import TrainingArguments # Define output directory for checkpoints and logs output_dir = "./results/distilbert-imdb-full" training_args = TrainingArguments( output_dir=output_dir, evaluation_strategy="epoch", # Evaluate performance at the end of each epoch save_strategy="epoch", # Save checkpoint at the end of each epoch num_train_epochs=3, # Number of training epochs (adjust as needed) per_device_train_batch_size=8, # Batch size per GPU per_device_eval_batch_size=8, # Batch size for evaluation learning_rate=2e-5, # Starting learning rate (a common value for fine-tuning) weight_decay=0.01, # Apply weight decay for regularization logging_dir='./logs', # Directory for storing logs logging_steps=100, # Log training loss every 100 steps load_best_model_at_end=True, # Load the best performing model checkpoint at the end metric_for_best_model="accuracy", # Metric to determine the 'best' model # Use push_to_hub=True to upload results to Hugging Face Hub (requires login) # push_to_hub=False, )These arguments directly control the fine-tuning process discussed theoretically in previous sections. Choosing appropriate values (like learning rate and epochs) often requires experimentation.Defining Evaluation MetricsTo monitor performance during training, we need a function that calculates metrics based on model predictions and true labels. We'll use standard accuracy for this classification task.import numpy as np from datasets import load_metric # Load the accuracy metric metric = load_metric("accuracy") def compute_metrics(eval_pred): logits, labels = eval_pred predictions = np.argmax(logits, axis=-1) return metric.compute(predictions=predictions, references=labels)Initializing the TrainerThe Trainer class simplifies the training loop. It orchestrates data loading, model forward/backward passes, optimization, evaluation, and checkpointing based on the provided model, arguments, datasets, tokenizer, and metrics function.from transformers import Trainer trainer = Trainer( model=model, args=training_args, train_dataset=small_train_dataset, # Use full dataset for actual training eval_dataset=small_eval_dataset, # Use full test set for proper evaluation tokenizer=tokenizer, # Useful for padding collation compute_metrics=compute_metrics, )Starting the Fine-tuning ProcessWith everything set up, initiating the full fine-tuning process is a single command:# Start the training process train_result = trainer.train() # Optionally, save training metrics trainer.log_metrics("train", train_result.metrics) trainer.save_metrics("train", train_result.metrics) # Save the final fine-tuned model and tokenizer trainer.save_model(output_dir) # Saves the best model due to load_best_model_at_end=True trainer.save_state() # Saves trainer state including RNG statesDuring training, you will see output logs showing the training loss decreasing and evaluation metrics (accuracy) potentially improving after each epoch. This directly corresponds to the model parameters $\theta$ being updated via gradient descent based on the loss computed on the imdb dataset.Evaluating the Fine-tuned ModelAfter training completes, you can explicitly run evaluation on the test set (or the specified eval_dataset).# Evaluate the final model eval_results = trainer.evaluate() # Print evaluation results print(f"Evaluation results: {eval_results}") trainer.log_metrics("eval", eval_results) trainer.save_metrics("eval", eval_results)The output will show the performance of your fine-tuned model on the held-out evaluation data.Resource Considerations and Next Steps → ### Resource Demands and Next StepsEven with DistilBERT and a subset of the data, you likely observed that training required a noticeable amount of time and GPU memory. Scaling this to larger models like GPT-3 variants or Llama models necessitates significantly more resources, often involving multiple high-end GPUs and distributed training strategies, as discussed in Chapter 7. The memory footprint grows roughly quadratically with sequence length and linearly (or more, depending on the architecture) with the number of parameters being updated.This hands-on exercise demonstrates the complete workflow for full parameter fine-tuning using standard tools. You loaded data, prepared it, configured training, executed the fine-tuning loop, and evaluated the result. This process, while effective, highlights the computational cost involved in updating every model parameter. This cost motivates the exploration of Parameter-Efficient Fine-tuning (PEFT) methods, which we will cover in the next chapter, aiming to achieve comparable adaptation results with drastically reduced computational requirements.