Let's put the concepts from this chapter into practice. We'll focus on fine-tuning a pre-trained CNN model on a specialized dataset, applying some of the advanced strategies discussed, such as discriminative learning rates and gradual unfreezing. This scenario is common when adapting powerful models, originally trained on large datasets like ImageNet, to more niche tasks with potentially limited data.
Imagine we have a specialized dataset, let's call it "FineGrainedParts", containing images of various industrial components like screws, bolts, and washers, categorized into 50 specific subtypes. This dataset is significantly smaller than ImageNet and exhibits different visual characteristics (e.g., metallic textures, uniform backgrounds, subtle inter-class variations). Our goal is to build an accurate classifier for these parts.
We'll start with a standard architecture, like ResNet50, pre-trained on ImageNet. Most deep learning frameworks provide easy access to such models. We assume you have a working environment with PyTorch or TensorFlow. Here, we'll use PyTorch-like syntax for illustration.
First, load the pre-trained model. We need to replace the final classification layer, which was originally trained for 1000 ImageNet classes, with a new layer suitable for our 50 "FineGrainedParts" classes.
import torch
import torchvision.models as models
import torch.nn as nn
# Load a pre-trained ResNet50 model
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V1)
# Get the number of input features for the classifier
num_ftrs = model.fc.in_features
# Replace the final fully connected layer
# Our dataset has 50 classes
num_classes = 50
model.fc = nn.Linear(num_ftrs, num_classes)
# Define the device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = model.to(device)
print("Model loaded and final layer replaced.")
# Output should confirm model structure change
A simple approach is to fine-tune all layers simultaneously with a single, small learning rate. However, as discussed earlier, this might not be optimal. Early layers in a pre-trained model often learn general features (edges, textures) that are broadly useful, while later layers learn more task-specific features. Updating all layers aggressively from the start, especially on a small or dissimilar dataset, can disrupt the valuable learned representations in the early layers. This is where advanced techniques come into play.
This technique involves applying different learning rates to different parts of the network. We typically use smaller learning rates for earlier layers (preserving general features) and larger learning rates for later layers (adapting specific features and the new classifier head).
Let's define parameter groups for our ResNet50 model. We can group the initial convolutional block, the sequential residual blocks (layer1, layer2, layer3, layer4), and the final classifier layer.
import torch.optim as optim
# Define base learning rate and multiplier
base_lr = 1e-4
lr_multiplier = 10
# Group parameters with different learning rates
# Smaller LR for early layers, larger for later layers/classifier
optimizer = optim.AdamW([
{'params': model.conv1.parameters(), 'lr': base_lr / (lr_multiplier**4)},
{'params': model.bn1.parameters(), 'lr': base_lr / (lr_multiplier**4)},
{'params': model.relu.parameters(), 'lr': base_lr / (lr_multiplier**4)},
{'params': model.maxpool.parameters(), 'lr': base_lr / (lr_multiplier**4)},
{'params': model.layer1.parameters(), 'lr': base_lr / (lr_multiplier**3)},
{'params': model.layer2.parameters(), 'lr': base_lr / (lr_multiplier**2)},
{'params': model.layer3.parameters(), 'lr': base_lr / lr_multiplier},
{'params': model.layer4.parameters(), 'lr': base_lr},
{'params': model.avgpool.parameters(), 'lr': base_lr * lr_multiplier},
{'params': model.fc.parameters(), 'lr': base_lr * lr_multiplier}
], lr=base_lr) # Default LR (won't be used if all params are grouped)
print("Optimizer configured with discriminative learning rates.")
# Verify optimizer parameter groups (optional)
# for group in optimizer.param_groups:
# print(f"LR: {group['lr']}, Num Params: {sum(p.numel() for p in group['params'])}")
This setup assigns exponentially decreasing learning rates to earlier layers, allowing the newly added classifier and later layers to adapt more quickly while protecting the foundational features learned during pre-training.
Another effective strategy, particularly useful for smaller target datasets, is gradual unfreezing. Initially, we freeze all pre-trained layers and only train the newly added classifier head. Once the classifier starts learning, we unfreeze progressively deeper layers and continue training, often lowering the overall learning rate as more layers become trainable.
Phase 1: Train Only the Classifier Head
# Freeze all layers except the final classifier
for param in model.parameters():
param.requires_grad = False
model.fc.requires_grad = True
# Optimizer for only the classifier parameters
optimizer_phase1 = optim.AdamW(model.fc.parameters(), lr=base_lr * lr_multiplier)
print("Phase 1: Training only the classifier head.")
# Assume a function train_model(model, optimizer, num_epochs) exists
# train_model(model, optimizer_phase1, num_epochs=5)
Phase 2: Unfreeze Top Layers and Train
After the initial phase, unfreeze some of the later layers (e.g., layer4
and layer3
) and continue training with a lower learning rate, possibly using discriminative rates.
# Unfreeze layer4 and layer3
for param in model.layer4.parameters():
param.requires_grad = True
for param in model.layer3.parameters():
param.requires_grad = True
# Reconfigure optimizer with parameters from fc, layer4, layer3
# Example using a single lower LR for simplicity here
# A discriminative approach as shown before could also be applied
trainable_params = list(model.fc.parameters()) + \
list(model.layer4.parameters()) + \
list(model.layer3.parameters())
optimizer_phase2 = optim.AdamW(trainable_params, lr=base_lr / 10)
print("Phase 2: Training classifier head, layer4, and layer3.")
# train_model(model, optimizer_phase2, num_epochs=10) # Continue training
Subsequent Phases:
You can continue this process, unfreezing more layers (e.g., layer2
, layer1
) and further reducing the learning rate until the entire network is trainable, or until performance plateaus.
Discriminative learning rates and gradual unfreezing can be combined. For instance, after unfreezing a block of layers, you can assign them a specific learning rate relative to the classifier head and other blocks.
Throughout this process, careful monitoring is essential. Track training and validation loss, as well as accuracy (or other relevant metrics for your specialized task). Pay close attention to validation performance to detect overfitting, which is a common risk when fine-tuning on smaller datasets. Techniques discussed in Chapter 2, such as data augmentation, dropout, and weight decay, are particularly important here.
Let's visualize how validation accuracy might progress using different fine-tuning strategies on our hypothetical "FineGrainedParts" dataset.
Comparison of validation accuracy over epochs for naive fine-tuning versus a strategy combining gradual unfreezing and discriminative learning rates on the hypothetical FineGrainedParts dataset. The advanced strategy often leads to faster convergence and higher final accuracy.
This practical exercise demonstrates how advanced fine-tuning techniques allow you to adapt powerful pre-trained models effectively, even when facing the challenges of specialized datasets with limited examples or different data distributions compared to the original pre-training data. Remember that the optimal strategy often requires experimentation and careful monitoring tailored to your specific model, dataset, and task.
© 2025 ApX Machine Learning