Graph Attention Networks (GAT)

A core challenge in graph neural networks is how to effectively aggregate information from a node's neighborhood. Many existing approaches, such as Graph Convolutional Networks (GCNs) and GraphSAGE, treat each neighbor with a fixed or uniform importance. For example, a GCN layer assigns weights based on node degrees, while a GraphSAGE layer with mean aggregation treats every neighbor equally. This raises a significant question: are all neighbors equally important for defining a central node's role or properties?

In many graphs, the answer is no. For instance, in a citation network, a citation from a seminal paper should likely carry more weight than a citation from an obscure workshop article. Graph Attention Networks (GATs) address this by introducing a mechanism that allows the model to learn the relative importance of different neighbors. Instead of using fixed aggregation weights, GATs compute attention coefficients for each edge, effectively learning how much "attention" a node should pay to each of its neighbors during the aggregation process.

The Self-Attention Mechanism

The core of a GAT layer is a self-attention mechanism applied directly to the graph structure. This process computes the updated features for each node by attending over its neighbors. The operation is broken down into a few distinct steps.

First, as in other GNNs, a shared linear transformation, parameterized by a weight matrix $W$ , is applied to every node's feature vector $h_i$ . This projects the features into a potentially different dimensional space where the model can better learn discriminative properties.

z_i = W h_i

Next, for each edge from a neighbor $j$ to a target node $i$ , the model computes a raw, un-normalized attention score $e_{ij}$ . This score indicates the importance of node $j$ 's features to node $i$ . This is typically calculated by a simple single-layer feed-forward network, parameterized by a weight vector $a$ , which takes the concatenated transformed feature vectors of the two nodes as input.

e_{ij} = \text{LeakyReLU}(a^T [z_i || z_j]) = \text{LeakyReLU}(a^T [W h_i || W h_j])

Here, $||$ denotes concatenation. The LeakyReLU activation function is applied to introduce non-linearity. This mechanism is shared across all edges in the graph, meaning the model learns a single, universal function for calculating attention.

These raw scores $e_{ij}$ are not easily comparable across different neighborhoods. To address this, we normalize them using the softmax function across all of a node's neighbors $\mathcal{N}_i$ . This converts the raw scores into a probability distribution of attention coefficients $\alpha_{ij}$ .

\alpha_{ij} = \text{softmax}_j(e_{ij}) = \frac{\exp(e_{ij})}{\sum_{k \in \mathcal{N}_i} \exp(e_{ik})}

The resulting coefficient $\alpha_{ij}$ represents the learned importance of neighbor $j$ to node $i$ .

Finally, the updated feature vector for node $i$ , denoted $h'_i$ , is computed as a weighted sum of its neighbors' transformed features, using the attention coefficients as weights. An activation function $\sigma$ (such as ReLU) is typically applied to the result.

h'_i = \sigma\left(\sum_{j \in \mathcal{N}_i} \alpha_{ij} z_j\right) = \sigma\left(\sum_{j \in \mathcal{N}_i} \alpha_{ij} W h_j\right)

This entire process constitutes one GAT layer. By learning the attention weights $\alpha_{ij}$ , the model can dynamically adjust the influence of each neighbor, a significant step up in expressive power compared to the static aggregation of GCNs.

The GAT layer computes attention coefficients ( $\alpha$ ) for each incoming edge to a target node. These coefficients determine the weight of each neighbor's contribution to the target node's updated representation. In this diagram, neighbor $j_2$ has the highest attention weight.

Multi-Head Attention

To make the learning process more stable and to allow the model to capture different types of relationships, GATs employ a multi-head attention mechanism. This is similar to how convolutional neural networks use multiple filters to capture different features (e.g., vertical edges, horizontal edges, colors).

In multi-head attention, several independent attention mechanisms, or "heads," execute the attention computation in parallel. Each head has its own set of parameters ( $W^k$ and $a^k$ for the $k$ -th head) and computes its own set of attention coefficients $\alpha_{ij}^k$ .

Each head produces an embedding. These embeddings are then combined to form the final output. For intermediate layers, the outputs are typically concatenated. If we use $K$ attention heads, the formula becomes:

h'_i = \Big\Vert_{k=1}^K \sigma\left(\sum_{j \in \mathcal{N}_i} \alpha_{ij}^k W^k h_j\right)

where $||$ again denotes concatenation. This results in an output feature vector that is $K$ times larger than the output of a single head.

For the final layer of the network, concatenation is no longer sensible. Instead, the outputs of the heads are usually averaged before applying the final activation function.

h'_i = \sigma\left(\frac{1}{K} \sum_{k=1}^K \sum_{j \in \mathcal{N}_i} \alpha_{ij}^k W^k h_j\right)

Using multiple heads helps the model learn a richer set of features, as each head can focus on a different aspect of the neighborhood's structure and feature space.

GAT Properties

Graph Attention Networks have several beneficial properties:

Computational Efficiency: The self-attention operation can be parallelized across all edges, and the computation of output features can be parallelized across all nodes.
Inductive Capability: Like GraphSAGE, GATs are inductive. The attention mechanism is shared across all nodes and depends only on local neighborhood information. This means a GAT trained on one graph can be applied to generate embeddings for nodes in a completely different graph.
Expressiveness: By allowing nodes to assign different levels of importance to their neighbors, GATs are often more expressive than GCNs and can achieve better performance on tasks with complex relational structures.

The primary trade-off is computational cost. Calculating attention coefficients for every edge adds overhead compared to the simpler aggregation in GCNs or mean-aggregator GraphSAGE, especially in very dense graphs. However, the performance gains often justify this additional cost.

Was this section helpful?

References

Graph Attention Networks, Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio, 2018 ICLR DOI: 10.48550/arXiv.1710.10903 - The original paper introducing Graph Attention Networks, detailing their self-attention mechanism and multi-head attention.
Deep Learning on Graphs, Yao Ma and Jiliang Tang, 2021 (Cambridge University Press) DOI: 10.1017/9781108924184 - A book covering various graph neural network architectures, providing a structured understanding of GATs within the broader context of deep learning on graphs.
CS224W: Machine Learning with Graphs, Jure Leskovec, 2025 (Stanford University) - Lecture materials from a Stanford University course, offering clear explanations and examples of GATs and other graph neural networks.