Some GNN architectures, such as GCN and GraphSAGE, treat all neighbors of a node with equal or structurally-determined importance. A GCN, for example, averages the features of neighboring nodes using weights derived from node degrees. This is computationally efficient, but it's a significant limitation. In many graphs, some neighbors are far more important than others. Think of a citation network where a foundational paper cited by a new article should carry more weight than a tangential reference.
Graph Attention Networks (GATs) address this by allowing each node to learn the relative importance of its neighbors. Instead of using fixed weights, GATs compute attention coefficients for each edge, effectively learning how much "attention" a node should pay to each of its neighbors during the aggregation step. This mechanism is inspired by the successful attention models used in natural language processing, particularly the Transformer.
The core of the GAT layer is the self-attention mechanism, which computes a score for every edge in the graph. This score, an attention coefficient , quantifies the importance of node 's features to node .
The process begins with the input features for two connected nodes, and , where and is the number of nodes and is the number of features.
Linear Transformation: First, a shared learnable linear transformation, parameterized by a weight matrix , is applied to every node's feature vector. This projects the features into a higher-level representation.
Scoring Mechanism: The GAT paper proposes a simple single-layer feedforward network to compute the attention coefficient. The transformed feature vectors of the two nodes, and , are concatenated. This combined vector is then multiplied by a learnable weight vector , followed by a LeakyReLU non-linearity.
This calculation produces a single scalar score for the edge from to . This score is un-normalized and represents the raw importance of their connection. This process is performed for every neighbor of node .
These raw attention scores are not directly usable because they are difficult to compare across different nodes. A node with many neighbors might have scores with a different scale than a node with few neighbors. To make them comparable and turn them into a probability distribution, we normalize them using the softmax function.
A significant detail here is masked attention. The softmax is only applied over the set of neighbors of node , denoted by . This injects the graph structure into the mechanism, ensuring that a node only computes attention scores with its immediate neighbors.
The final attention weight is calculated as:
The resulting weights are positive and sum to 1 for all neighbors of node . They represent a distribution of attention that node places on its neighbors.
The flow of the GAT attention mechanism for a single node
iand its neighbors. The process computes, normalizes, and applies attention weights to create an updated node representation.
With the normalized attention weights in hand, the final step is to compute the new feature vector for node , denoted as . This is done by taking a weighted sum of the linearly transformed features of all its neighbors.
Here, represents a final non-linearity, such as ReLU or ELU, applied to the aggregated features. This entire process constitutes a single GAT layer.
The learning process of self-attention can sometimes be unstable. To mitigate this and to allow the model to capture different types of relationships, GATs employ multi-head attention. This involves running several independent attention mechanisms, or "heads," in parallel.
Each head has its own set of parameters ( and ) and computes its own set of attention weights . Each head produces a separate embedding. These embeddings are then combined to form the final output. There are two common ways to combine them:
Concatenation (for intermediate layers): The outputs from all heads are concatenated, creating a larger feature vector.
The resulting output has a dimension of , where is the output dimension of a single head.
Averaging (for the final layer): For the final output layer of the network (e.g., for classification), concatenation is not sensible. Instead, the embeddings from the different heads are averaged.
By using multiple heads, the model can jointly attend to information from different representation subspaces at different positions. For example, one head might focus on community structure while another focuses on local connectivity patterns, making the overall model more expressive and stable.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with