Standard convolutional layers operate locally, processing information within a confined neighborhood defined by the kernel size. While stacking these layers increases the effective receptive field, efficiently capturing dependencies between spatially distant features (long-range dependencies) remains challenging. For instance, understanding the relationship between a person holding an object and the object itself, even if they appear far apart in the image, requires modeling interactions outside immediate adjacency.
Non-local Neural Networks directly address this limitation by computing the response at a position as a weighted sum of features at all positions in the input feature map. This allows the network to capture global context and model relationships between any two locations, regardless of their spatial distance. Think of it as a generalization of self-attention mechanisms applied to spatial or spatio-temporal data.
The Non-local Operation
The core idea is elegantly captured in a general formula for a non-local operation. Given an input feature map x (which could be an image, or feature maps from intermediate layers of a CNN), the output feature map y at a position i is computed as:
yi=C(x)1∀j∑f(xi,xj)g(xj)
Let's break down the components:
- i: The index of the output position whose response is being calculated.
- j: An index that iterates over all possible positions in the input feature map.
- xi: The feature vector at position i in the input.
- xj: The feature vector at position j in the input.
- f(xi,xj): A pairwise function that computes a scalar representing the relationship (e.g., affinity or similarity) between position i and position j.
- g(xj): A function that computes a representation or transformation of the input feature vector at position j. This is often a linear embedding.
- C(x): A normalization factor, calculated as ∑∀jf(xi,xj), ensuring the weights sum appropriately (often to 1, similar to a softmax).
Essentially, the response at position i (yi) is a weighted average of the transformed features (g(xj)) from all positions j. The weights are determined by the similarity or relationship (f) between position i and each position j.
Instantiations of the Pairwise and Transformation Functions
The generic non-local operation formula allows for different specific implementations based on the choice of functions f and g.
Transformation Function g:
A common choice for g is a simple linear embedding learned via a 1×1 convolution:
g(xj)=Wgxj
Here, Wg represents the weights of the 1×1 convolutional layer.
Pairwise Function f:
Several options exist for the pairwise function f, measuring the relationship between i and j:
-
Embedded Gaussian: This is perhaps the most common choice and directly relates to self-attention. First, linear embeddings θ and ϕ (again, typically 1×1 convolutions with weight matrices Wθ and Wϕ) are applied to the input features. The function is then:
f(xi,xj)=eθ(xi)Tϕ(xj)=e(Wθxi)T(Wϕxj)
The normalization factor C(x) in this case makes the weighting C(x)f(xi,xj) equivalent to a softmax function applied over all positions j.
-
Dot Product: A simpler version of the Embedded Gaussian, omitting the exponential:
f(xi,xj)=θ(xi)Tϕ(xj)
Here, the normalization C(x) might be the number of positions N.
-
Concatenation: The embedded features are concatenated, passed through a linear layer (with weight vector wf), and activated (e.g., with ReLU):
f(xi,xj)=ReLU(wfT[θ(xi),ϕ(xj)])
This allows for potentially more complex relationship modeling.
The Embedded Gaussian approach is widely used due to its connection to attention mechanisms and empirical success.
Implementing Non-local Blocks
Non-local operations are typically integrated into existing deep learning architectures as "Non-local Blocks". These blocks often employ a residual connection to facilitate training, similar to ResNet blocks.
The structure of a common Non-local Block (using the Embedded Gaussian version) is as follows:
- Input: Feature map x (e.g., dimensions H×W×C).
- Embeddings: Compute θ(x), ϕ(x), and g(x) using separate 1×1 convolutions. θ and ϕ typically reduce the channel dimension (e.g., to C/2) to save computation, while g might maintain or reduce it. Let the outputs be Θ, Φ, G.
- Affinity Calculation: Reshape Θ and Φ to matrices where one dimension represents the spatial locations (H×W) and the other represents the channels (C/2). Calculate the dot product ΘTΦ (after appropriate transpositions). This gives a matrix of size (HW)×(HW) representing affinities between all pairs of locations.
- Normalization: Apply softmax row-wise (or column-wise, depending on matrix setup) to the affinity matrix to get normalized weights A.
- Weighted Sum: Reshape G similarly to Θ and Φ (dimension HW×C′). Compute the weighted sum by matrix multiplying the normalized affinities A with G. Let the result be Y′.
- Reshape and Project: Reshape Y′ back to the spatial dimensions H×W×C′. Apply a final 1×1 convolution (with weights Wz) to project it back to the original channel dimension C. Let this be Y. The weights Wz are often initialized to zero, so the block initially behaves like an identity function.
- Residual Connection: Add the output Y back to the original input x: z=Y+x.
This structure allows the network to learn long-range dependencies while retaining the original information flow through the residual connection.
Data flow within a Non-local Block using Embedded Gaussian affinity and a residual connection. The θ, ϕ, and g transformations are typically implemented using 1×1 convolutions.
Connection to Self-Attention
If you are familiar with the Transformer architecture, you'll recognize the Embedded Gaussian non-local operation as being equivalent to the scaled dot-product self-attention mechanism.
- θ(xi) corresponds to the "Query".
- ϕ(xj) corresponds to the "Key".
- g(xj) corresponds to the "Value".
The non-local block computes the attention weights between each query position i and all positions j, then uses these weights to form a weighted sum of the values. Non-local networks essentially introduced self-attention to the computer vision domain, applying it directly to feature maps rather than sequences of word embeddings.
Applications and Approaches
Non-local blocks have proven effective in tasks where long-range interactions are significant:
- Video Analysis: Capturing temporal dependencies between frames for action recognition.
- Object Detection & Segmentation: Modeling context, such as relationships between objects or between object parts and the whole object. Mask R-CNN, for example, benefits from incorporating contextual information.
- Image Generation: Improving the global coherence and structure of generated images.
However, a significant consideration is computational cost. Calculating the pairwise affinities involves comparing every position with every other position. If the input feature map has N=H×W spatial locations, the complexity is O(N2), which can be demanding for high-resolution feature maps.
Mitigation Strategies:
To manage this cost, implementations often:
- Reduce channel dimensions via the 1×1 convolutions in θ, ϕ, and g.
- Apply spatial subsampling (e.g., max-pooling) to the (ϕ) and value (g) inputs before the pairwise computations. This reduces the number of positions j involved in the sum.
- Insert Non-local blocks only in deeper layers of the network where feature maps are smaller, or sparingly throughout the architecture.
Non-local networks provide a powerful and general mechanism for incorporating non-local interactions within deep learning models for vision. They represent an important step towards models that can reason about global image structure and context, complementing the local feature extraction capabilities of standard convolutions and leading to architectures like Vision Transformers.