Building on the idea that Transformers can model long-range dependencies, the Vision Transformer (ViT) applies this architecture directly to image classification, largely discarding the specialized inductive biases inherent in CNNs, such as locality and translation equivariance. Proposed by Dosovitskiy et al. in "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (2020), ViT processes images by treating them as sequences of smaller patches, analogous to how Transformers process sequences of words in natural language processing. Let's break down its core architectural components.
The first step in ViT is to decompose the input image into a sequence of fixed-size patches. Instead of processing pixels individually or through sliding convolutional windows, ViT reshapes the image into a series of flattened patches.
Consider an input image x with dimensions H×W×C (Height, Width, Channels). ViT divides this image into N non-overlapping patches, each of size P×P. The total number of patches is N=(H×W)/P2. Each patch is then flattened into a vector of size P2⋅C.
For example, a 224×224 RGB image (C=3) processed with a patch size of P=16 would result in N=(224×224)/162=14×14=196 patches. Each patch is 16×16×3=768 dimensional when flattened. The image is thus transformed from a spatial grid into a sequence of 196 vectors, each of length 768. This sequence formation is fundamental for compatibility with the Transformer architecture. The choice of patch size P presents a trade-off: smaller patches lead to longer sequences (N increases) and finer spatial granularity but increase computational cost, while larger patches result in shorter sequences but potentially lose fine details.
Transformers operate on sequences of embedding vectors of a specific dimension, typically denoted as D. The flattened patches, with their P2⋅C dimensions, need to be projected into this D-dimensional embedding space. This is achieved using a trainable linear projection, often implemented as a single linear layer (or a 1D convolution with kernel size and stride equal to P).
Let E∈R(P2⋅C)×D be the learnable embedding matrix (weights of the linear layer). Each flattened patch vector xpi∈RP2⋅C (where i ranges from 1 to N) is multiplied by E to produce a patch embedding zpi∈RD:
zpi=xpiE
This projection allows the model to learn representations for the patches in a suitable dimension for the subsequent Transformer layers.
Inspired by BERT's [CLS]
token in NLP, ViT prepends a learnable embedding vector, zclass∈RD, to the sequence of patch embeddings. This class token has no direct correspondence to any specific image patch but acts as a global representation aggregator. It interacts with the patch embeddings throughout the Transformer encoder layers via the self-attention mechanism. The final output state of this class token after passing through the entire encoder is then used as the aggregate image representation for classification.
The standard Transformer architecture is permutation-invariant; it does not inherently understand the order or spatial location of tokens in a sequence. However, the spatial arrangement of patches is obviously significant for image understanding. To incorporate this spatial information, ViT adds position embeddings to the patch embeddings (including the class token).
These are typically standard, learnable 1D position embeddings Epos∈R(N+1)×D. Each row of Epos corresponds to a specific position in the sequence (0 for the class token, 1 to N for the patches), and the entire matrix is learned during training. The position embedding corresponding to each sequence element is added element-wise to the patch embedding (or class token embedding).
The resulting sequence of embedding vectors, which serves as the input to the Transformer encoder (z0), is constructed as:
z0=[zclass;zp1;zp2;…;zpN]+Epos
where [;] denotes concatenation along the sequence dimension. Each element in this sequence z0 is a vector of dimension D.
The core of the ViT is a stack of L identical Transformer encoder blocks. These blocks are responsible for processing the sequence of embeddings, allowing information to flow between different patch representations via self-attention. Each block typically consists of two main sub-layers:
Layer Normalization (LN) is applied before each sub-layer (MHSA and MLP), and residual connections are used around each sub-layer. The processing within a single encoder block l (for l from 1 to L) can be summarized as:
The output of the final block, zL, contains the processed representations for the class token and all patches.
Overall architecture of the Vision Transformer (ViT). Input image is divided into patches, linearly projected, augmented with position embeddings and a class token, processed by a stack of Transformer encoder layers, and finally classified using the output representation of the class token.
After processing through the L Transformer encoder layers, the output state corresponding to the initial [class]
token, denoted zL0∈RD, is considered the final image representation. This single vector is then fed into a classification head to produce the final class predictions. Typically, the head consists of a Layer Normalization step followed by a single linear layer that maps the D-dimensional representation to the number of output classes.
y=Linear(LN(zL0))
This approach contrasts with CNNs, which often use global average pooling over the final feature map before classification. While global pooling over the output patch embeddings is a possible alternative in ViT, using the dedicated class token is the standard method described in the original paper.
In summary, the ViT architecture directly adapts the Transformer model for image recognition by converting images into sequences of embedded patches with position information. Its core strength lies in the Transformer encoder's ability to model global relationships between patches using self-attention, deviating significantly from the localized processing characteristic of CNNs. However, this lack of strong spatial inductive biases often means ViTs require larger datasets or extensive pre-training compared to CNNs to achieve comparable performance.
© 2025 ApX Machine Learning