Networking Topologies for ML Clusters

The performance of collective communication operations in a distributed system, such as the AllReduce used in data-parallel training, is directly limited by the network's ability to move data between compute nodes. While intra-node communication is handled efficiently by high-speed interconnects like NVLink, the inter-node network is where performance bottlenecks frequently appear. An improperly designed network topology can easily negate the benefits of having the most powerful accelerators, leading to idle GPUs waiting on data and dramatically increasing training times and costs. The primary objective of network design for an ML cluster is to provide predictable, high-bandwidth, and low-latency communication paths between all nodes.

The Problem with Traditional Topologies: Oversubscription

Traditional enterprise network designs, often based on a simple tree structure, are built to optimize for north-south traffic, which flows between end-users and central servers or the internet. They are not designed for the intense east-west traffic patterns characteristic of high-performance computing and distributed ML. In these workloads, every node may need to communicate with every other node simultaneously.

This mismatch leads to a condition known as oversubscription. Oversubscription occurs when the bandwidth available at a higher level of the network hierarchy is less than the aggregate bandwidth of the levels below it. For example, if you have 48 servers each with a 100 GbE link connected to a single switch, that switch would need a 4.8 Tbps uplink to the next layer to avoid being a bottleneck. In practice, this uplink is often much smaller, creating contention. When multiple nodes attempt to send data through this constrained uplink, packets are dropped, latency increases, and overall throughput plummets.

The Spine-Leaf Solution

To address the limitations of traditional designs, modern data centers and ML clusters employ a spine-leaf architecture, a type of Clos network. This topology consists of two layers of switches:

Leaf Switches: These are the access layer switches that connect directly to the compute nodes (servers with GPUs).
Spine Switches: This is the network backbone. Every leaf switch connects to every spine switch in the fabric. Spine switches do not connect to each other, and leaf switches do not connect to each other.

This design ensures that traffic between any two nodes in the cluster travels a predictable, fixed path: from the source node to its leaf switch, up to a spine switch, and down to the destination node's leaf switch. The maximum number of hops between any two servers is always two, which provides consistent and low latency.

A two-tier spine-leaf network. Every leaf connects to every spine, creating multiple high-bandwidth paths and ensuring a maximum of two network hops between any two compute nodes.

The scalability of a spine-leaf fabric is another significant attribute. You can increase the east-west bandwidth by adding more spine switches or accommodate more servers by adding new leaf switches, all without redesigning the core network.

Fat-Tree Topologies and Non-Blocking Networks

For the most demanding distributed training jobs, the goal is to build a non-blocking network fabric. This is a network where sufficient capacity exists for all nodes to communicate with each other simultaneously at their full link speed. The fat-tree topology is a specific implementation of a spine-leaf network designed to achieve this.

The "fat" designation refers to the fact that the links get thicker, or have more aggregate bandwidth, as you move up the tree from the leaves to the spine. This structure ensures that the bandwidth at any layer of switches matches the total bandwidth of the devices connected below it.

The performance of a fat-tree is often described by its subscription ratio, which is the ratio of downstream bandwidth (connections to servers) to upstream bandwidth (connections to the next layer of switches).

A 1:1 subscription ratio defines a non-blocking network. For a leaf switch with 24 ports connected to servers at 100 Gbps each (2.4 Tbps total), it must also have 2.4 Tbps of uplink bandwidth to the spine switches to be non-blocking.
A 3:1 subscription ratio means the switch has three times as much bandwidth connected to its servers as it has for uplinks. This is a state of oversubscription and can be a cost-saving measure, but it will create a bottleneck under heavy all-to-all communication patterns.

For large-scale ML, especially for training foundation models, aiming for a 1:1 subscription ratio in your network fabric is a standard best practice.

Bisection Bandwidth: The Ultimate Metric

While topology and subscription ratio are design attributes, the resulting performance is best measured by bisection bandwidth. This is the total available bandwidth between two equal halves of the network. Imagine drawing a line that cuts the cluster's network in half; the bisection bandwidth is the sum of the speeds of all the links that cross this line.

This metric is so important because it directly reflects the network's capacity to handle worst-case communication patterns, such as the AllReduce operation where every node must exchange gradients with every other node. A higher bisection bandwidth translates directly to faster completion of these collective operations, reducing the overall time-to-train. A well-designed fat-tree network is architected specifically to maximize bisection bandwidth.

When evaluating cloud provider offerings or designing an on-premises cluster, the bisection bandwidth is one of the most significant specifications to analyze. For instance, platforms like Google's TPU Pods and AWS's UltraClusters explicitly advertise their non-blocking, high-bisection-bandwidth networks as a main feature for large-scale training. These systems use high-speed Ethernet (e.g., 400 GbE) or InfiniBand in a carefully constructed fat-tree topology to ensure that network communication does not impede the computational power of the accelerators.

Was this section helpful?

References

A Domain Specific Supercomputer for Training Deep Neural Networks, Norman P. Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, David Patterson, 2020 Communications of the ACM, Vol. 63 (Association for Computing Machinery (ACM)) DOI: 10.1145/3360307 - Details the network architecture of Google's TPU v4 Pods, showcasing how a large-scale, non-blocking fat-tree network with high bisection bandwidth is implemented for training very large ML models.