Standard convolutional layers operate locally, processing information within a confined neighborhood defined by the kernel size. While stacking these layers increases the effective receptive field, efficiently capturing dependencies between spatially distant features (long-range dependencies) remains challenging. For instance, understanding the relationship between a person holding an object and the object itself, even if they appear far apart in the image, requires modeling interactions beyond immediate adjacency.
Non-local Neural Networks directly address this limitation by computing the response at a position as a weighted sum of features at all positions in the input feature map. This allows the network to capture global context and model relationships between any two locations, regardless of their spatial distance. Think of it as a generalization of self-attention mechanisms applied to spatial or spatio-temporal data.
The core idea is elegantly captured in a general formula for a non-local operation. Given an input feature map x (which could be an image, or feature maps from intermediate layers of a CNN), the output feature map y at a position i is computed as:
yi=C(x)1∀j∑f(xi,xj)g(xj)Let's break down the components:
Essentially, the response at position i (yi) is a weighted average of the transformed features (g(xj)) from all positions j. The weights are determined by the similarity or relationship (f) between position i and each position j.
The generic non-local operation formula allows for different specific implementations based on the choice of functions f and g.
Transformation Function g: A common choice for g is a simple linear embedding learned via a 1×1 convolution: g(xj)=Wgxj Here, Wg represents the weights of the 1×1 convolutional layer.
Pairwise Function f: Several options exist for the pairwise function f, measuring the relationship between i and j:
Embedded Gaussian: This is perhaps the most common choice and directly relates to self-attention. First, linear embeddings θ and ϕ (again, typically 1×1 convolutions with weight matrices Wθ and Wϕ) are applied to the input features. The function is then: f(xi,xj)=eθ(xi)Tϕ(xj)=e(Wθxi)T(Wϕxj) The normalization factor C(x) in this case makes the weighting C(x)f(xi,xj) equivalent to a softmax function applied over all positions j.
Dot Product: A simpler version of the Embedded Gaussian, omitting the exponential: f(xi,xj)=θ(xi)Tϕ(xj) Here, the normalization C(x) might be the number of positions N.
Concatenation: The embedded features are concatenated, passed through a linear layer (with weight vector wf), and activated (e.g., with ReLU): f(xi,xj)=ReLU(wfT[θ(xi),ϕ(xj)]) This allows for potentially more complex relationship modeling.
The Embedded Gaussian approach is widely used due to its connection to attention mechanisms and empirical success.
Non-local operations are typically integrated into existing deep learning architectures as "Non-local Blocks". These blocks often employ a residual connection to facilitate training, similar to ResNet blocks.
The structure of a common Non-local Block (using the Embedded Gaussian version) is as follows:
This structure allows the network to learn long-range dependencies while retaining the original information flow through the residual connection.
Data flow within a Non-local Block using Embedded Gaussian affinity and a residual connection. The θ, ϕ, and g transformations are typically implemented using 1×1 convolutions.
If you are familiar with the Transformer architecture, you'll recognize the Embedded Gaussian non-local operation as being equivalent to the scaled dot-product self-attention mechanism.
The non-local block computes the attention weights between each query position i and all key positions j, then uses these weights to form a weighted sum of the values. Non-local networks essentially introduced self-attention to the computer vision domain, applying it directly to feature maps rather than sequences of word embeddings.
Non-local blocks have proven effective in tasks where long-range interactions are significant:
However, a significant consideration is computational cost. Calculating the pairwise affinities involves comparing every position with every other position. If the input feature map has N=H×W spatial locations, the complexity is O(N2), which can be demanding for high-resolution feature maps.
Mitigation Strategies: To manage this cost, implementations often:
Non-local networks provide a powerful and general mechanism for incorporating non-local interactions within deep learning models for vision. They represent an important step towards models that can reason about global image structure and context, complementing the local feature extraction capabilities of standard convolutions and paving the way for architectures like Vision Transformers.
© 2025 ApX Machine Learning