While convolutional layers in GANs excel at capturing local patterns, their fixed, local receptive fields limit their ability to model long-range dependencies across an image. For instance, generating a realistic animal might require ensuring consistency between the head orientation and the limb positions, features that could be far apart in the pixel space. Standard convolutions struggle with this, needing many layers to propagate information across large distances, which can dilute signals and increase computational burden without guaranteeing effective modeling of these relationships.
Attention mechanisms provide a powerful alternative, allowing the network to selectively focus on relevant parts of the input, regardless of their spatial distance. In the context of GANs, particularly image generation, self-attention has proven highly effective.
The Self-Attention Generative Adversarial Network (SAGAN) introduced a self-attention module directly into the generator and discriminator architectures. This module allows the network to weigh the importance of features at all other positions when calculating the feature response at a given position.
Consider an input feature map x∈RC×H×W from a preceding convolutional layer. The self-attention module computes three representations from this input:
Here, Wq, Wk, and Wv are learnable weight matrices, typically implemented as 1×1 convolutions. The dimensions are often reduced for computational efficiency, for example, Wq,Wk∈RCˉ×C and Wv∈RC×C, where Cˉ=C/8 is common.
The core idea is to calculate the attention weights based on the similarity between queries and keys. The queries and keys are reshaped into RCˉ×N, where N=H×W is the number of feature map positions. The attention map β∈RN×N is computed as:
βj,i=∑k=1Nexp(sik)exp(sij)wheresij=q(xi)Tk(xj)This equation calculates the dot product similarity sij between the query at position i and the key at position j. A softmax function is then applied row-wise to normalize these similarities into attention weights βj,i, representing how much the model should attend to position i when synthesizing the output at position j.
The value representation v(x) is reshaped to RC×N. The output of the attention layer o∈RC×N is then computed as a weighted sum of the value features, using the attention map:
oj=i=1∑Nβj,iv(xi)This output o is reshaped back to RC×H×W. Finally, the module typically adds this attention-derived output back to the original input feature map x, often scaled by a learnable parameter γ initialized to 0:
y=γo+xInitializing γ to 0 allows the network to initially rely on local convolutional features and gradually learn to incorporate non-local information as needed during training.
Diagram illustrating the data flow within a self-attention module applied to a feature map. Input features are transformed into Query, Key, and Value representations. Attention weights are computed via matrix multiplication and softmax, then applied to the Value features. The result is scaled and added back to the original input.
In SAGAN, these self-attention modules are typically inserted into intermediate layers of both the generator and the discriminator.
By incorporating attention mechanisms, advanced GAN architectures like SAGAN overcome some fundamental limitations of purely convolutional approaches, enabling the generation of higher-fidelity images with better global structure and consistency. This technique represents a significant step towards capturing the complex, long-range correlations present in real-world data.
© 2025 ApX Machine Learning