As discussed earlier in this chapter, while CNNs effectively capture local features through their convolutional filters, modeling long-range dependencies across an entire image remains a challenge due to the inherently local nature of convolution operations. Increasing receptive fields significantly requires very deep networks or large kernels, which can be computationally expensive and hard to optimize. Techniques like attention mechanisms integrated into CNNs help, but they still operate within a fundamentally convolutional framework.
The Transformer architecture, initially developed for natural language processing (NLP), offered a different approach. Transformers rely entirely on self-attention mechanisms, allowing them to model dependencies between any two elements in a sequence, regardless of their distance. This proved highly effective for tasks like machine translation and text generation where understanding the context across long sentences is necessary.
A significant question arose: could this powerful sequence-modeling architecture be adapted for computer vision? Images, unlike text, are not inherently sequential one-dimensional structures. They possess a strong 2D spatial structure, and the number of pixels (potential sequence elements) in a typical image is vastly larger than the number of words in a typical sentence, posing computational challenges for the standard Transformer which has a quadratic complexity with respect to sequence length.
The Vision Transformer (ViT) model represents a successful and influential approach to applying Transformers directly to image classification. Proposed by Dosovitskiy et al. in 2020, the core idea is surprisingly straightforward: treat an image as a sequence of smaller, fixed-size patches.
An input image is divided into a grid of non-overlapping patches. Each patch is then flattened into a vector, forming a sequence that can be processed by a standard Transformer encoder.
Here's how it works at a high level:
[class]
token embedding is often prepended to the sequence, similar to BERT's [CLS]
token, whose corresponding output from the Transformer is used for classification.[class]
token is passed through a small classification head (typically an MLP) to produce the final prediction.This approach effectively bypasses the need for convolutions, directly applying the sequence-processing power of Transformers to visual data. However, a significant difference compared to CNNs is the lack of strong inductive biases. CNNs have built-in assumptions about locality (pixels nearby are related) and translation equivariance (a feature detected in one location can be detected elsewhere). ViT has much weaker biases, learning relationships almost entirely from data through self-attention. This makes ViT potentially more general but often requires significantly larger datasets for pre-training to achieve performance comparable to or exceeding state-of-the-art CNNs.
The introduction of ViT marked a notable shift in computer vision research, demonstrating that architectures heavily reliant on self-attention could achieve excellent results on image recognition tasks, previously dominated by CNNs. The following sections will provide a more detailed look into the specific architectural components of ViT and compare its characteristics with CNNs.
© 2025 ApX Machine Learning