As introduced earlier, leveraging pre-trained models through transfer learning is a standard practice in computer vision, significantly reducing the need for vast datasets and extensive training time. You're likely familiar with the fundamental techniques, but a brief review provides a solid starting point before we examine more sophisticated adaptation methods required for challenging real-application scenarios.
The core idea remains straightforward: a model, typically a Convolutional Neural Network (CNN), is first trained on a large, general-purpose dataset like ImageNet. This pre-training phase allows the model to learn a rich hierarchy of visual features, from simple edges and textures in the early layers to more complex object parts and shapes in deeper layers. These learned features often generalize well to other visual tasks. Instead of starting the learning process from random weights for a new task, we initialize our model using these pre-trained weights, transferring the learned knowledge.
Two primary strategies dominate the application of transfer learning:
In this approach, the pre-trained model, excluding its final classification layer (the "head"), is used as a fixed feature extractor. The learned weights of the convolutional base are frozen, meaning they are not updated during training on the new dataset.
This method is computationally efficient during training as gradients only need to be computed for the small, new head. It's particularly effective when the target dataset is small and similar to the dataset the original model was trained on (e.g., classifying different types of flowers using an ImageNet pre-trained model). The assumption is that the general features learned during pre-training are sufficiently representative for the new task.
A conceptual view of the feature extraction strategy. The pre-trained convolutional base layers are frozen, and only the newly added task-specific head is trained.
Fine-tuning takes the transfer learning process a step further. It starts similarly to feature extraction by initializing the model with pre-trained weights and adding a new head. However, instead of keeping the entire convolutional base frozen, some of the top layers of the pre-trained base are unfrozen and trained along with the new head.
Fine-tuning allows the model to adapt the pre-trained features more specifically to the nuances of the new dataset and task. It's generally preferred when the target dataset is reasonably large and potentially somewhat different from the original pre-training dataset. The low learning rate is important to prevent the large gradients from the randomly initialized head from destroying the valuable pre-trained weights in the base layers too quickly. Adjusting these higher-level features allows the model to specialize better.
A conceptual view of the fine-tuning strategy. Lower layers of the pre-trained base remain frozen, while the top layers and the new head are trained (fine-tuned) together, usually with a low learning rate.
It's useful to think of pure feature extraction and full fine-tuning (where all layers are unfrozen) as ends of a spectrum. The common practice often lies somewhere in between, involving selective unfreezing of layer blocks based on dataset size, task similarity, and computational budget.
This review sets the context for the rest of this chapter. While these standard strategies are powerful, they often assume that the source (pre-training) and target (new task) data distributions are relatively similar. The advanced techniques we will cover, such as domain adaptation, domain generalization, few-shot learning, and self-supervised pre-training, address scenarios where this assumption breaks down or where labeled data is scarce. Understanding the mechanics and trade-offs of feature extraction and fine-tuning is fundamental to appreciating why and how these more advanced adaptation strategies work.
© 2025 ApX Machine Learning