While Convolutional Neural Networks excel at learning local patterns through spatially restricted receptive fields, understanding global context and long-range dependencies within an image requires different approaches. This chapter introduces attention mechanisms and Transformer architectures as methods to enhance vision models' ability to capture these broader relationships.
You will learn how self-attention mechanisms can be integrated within CNN frameworks to allow the network to selectively focus on more informative features. We will cover specific examples like Squeeze-and-Excitation (SE) blocks and non-local networks. Following this, we examine the Vision Transformer (ViT), a model that applies the successful Transformer architecture directly to image data by processing sequences of image patches. We will study the ViT's core components, including patch embedding and the multi-head self-attention layers. Finally, we will discuss hybrid models combining convolutional and Transformer elements and compare the operational characteristics and data requirements of CNNs and ViTs.
5.1 Self-Attention Mechanisms in CNNs
5.2 Non-local Neural Networks
5.3 Introduction to Vision Transformers
5.4 ViT Architecture: Patches, Embeddings, Transformer Encoder
5.5 Hybrid CNN-Transformer Models
5.6 Comparing CNNs and Transformers for Vision Tasks
5.7 Implementing Attention Blocks in CNNs Practice
© 2025 ApX Machine Learning