High-Bandwidth Interconnects for Distributed Systems

When training a model on a single GPU, performance is limited by the processor's own computational capacity. When you scale training to multiple GPUs, a new bottleneck emerges: the communication link between them. During distributed training, GPUs must constantly exchange information, most notably the gradients calculated during backpropagation. If the interconnects between these processors are slow, the GPUs will spend more time waiting for data than performing computations, negating the benefits of having multiple accelerators. Standard motherboard interconnects like PCIe (Peripheral Component Interconnect Express), while fast for general-purpose peripherals, are often insufficient for the demands of large-scale model training.

This is where specialized high-bandwidth interconnects become essential. They are the high-speed data highways of modern AI supercomputers, designed specifically to minimize communication latency and maximize bandwidth between processors. We will examine three important technologies in this domain: NVLink, NVSwitch, and InfiniBand.

Intra-Node Communication: NVLink and NVSwitch

For communication within a single server or node, NVIDIA has developed a proprietary interconnect technology that provides a direct, high-speed link between GPUs.

NVLink

NVLink is a point-to-point GPU interconnect that offers significantly higher bandwidth than a standard PCIe lane. For instance, a single third-generation NVLink provides 50 GB/s of bidirectional bandwidth, whereas a PCIe 4.0 x16 slot provides 32 GB/s. More recent generations push this even further. By connecting GPUs directly, NVLink allows for faster model and data sharing, which is particularly useful for model parallelism, where different parts of a large model reside on different GPUs.

However, a simple point-to-point link has limitations. In a server with eight GPUs, you cannot create a direct NVLink connection from every GPU to every other GPU. This would require an unfeasible number of ports and complex wiring.

NVSwitch

NVSwitch solves the "all-to-all" communication problem within a node. It acts as a non-blocking crossbar switch for NVLink, enabling any GPU to communicate with any other GPU simultaneously at full NVLink speed. Think of it as a network switch, but one designed for the extreme speeds and low latencies required by tightly-coupled GPUs. High-end AI servers, such as NVIDIA's DGX systems, use NVSwitch to create a unified memory space across all GPUs in the node. This architecture is what makes training truly massive models on a single machine possible.

The diagram below illustrates the architectural difference. On the left, GPUs in a traditional server communicate over PCIe, often forcing data to travel through the CPU, which creates contention. On the right, NVSwitch provides a direct, high-bandwidth path between all GPUs.

A comparison of intra-node GPU communication architectures. The NVSwitch model provides a direct, non-blocking fabric for all GPUs, eliminating the bottlenecks present in a hierarchical PCIe-based system.

Inter-Node Communication: InfiniBand

While NVLink and NVSwitch excel at intra-node communication, they do not connect separate machines. To scale a training job from one 8-GPU server to hundreds of them, you need a high-performance network fabric that connects the nodes. This is the domain of InfiniBand.

InfiniBand is a computer networking standard used in high-performance computing (HPC) that offers both high throughput and very low latency compared to traditional Ethernet. Modern InfiniBand standards like NDR (NVIDIA Quantum-2) can provide up to 400 Gb/s of bandwidth per link.

The defining feature of InfiniBand for AI workloads is its support for Remote Direct Memory Access (RDMA). RDMA allows the network interface card (NIC) of one server to directly read from or write to the memory of another server, without involving either server's operating system or CPU. During an all-reduce operation in distributed training, this means a GPU's data can be sent directly to the memory of a remote GPU with minimal overhead. Bypassing the CPU and kernel dramatically reduces communication latency, which is a significant factor when performing frequent, small updates across many nodes.

A Hierarchical Communication Architecture

In practice, these technologies are combined to form a hierarchical communication architecture optimized for performance at different scales.

On-Chip: The fastest communication happens within the GPU itself.
Intra-Node: NVLink and NVSwitch provide an ultra-fast fabric for GPUs within a single server.
Inter-Node: InfiniBand with RDMA provides a fast, low-latency network for connecting multiple servers into a large, cohesive training cluster.
Cluster Services: Standard Ethernet is typically used for slower, less critical communication, such as connecting to storage systems, management networks, or user access.

This layered approach ensures that the most frequent and performance-sensitive communication, like that between neighboring GPUs in a model-parallel setup, happens over the fastest links.

A hierarchical communication architecture for a multi-node AI cluster. NVLink/NVSwitch handles fast communication within each node, while InfiniBand connects the nodes for distributed training. Standard Ethernet is used for less latency-sensitive tasks like accessing storage.

Choosing the right interconnects is a matter of matching infrastructure to workload. For experimenting with small models, standard networking may suffice. But for production-scale training of foundation models, a hierarchical architecture using NVLink, NVSwitch, and InfiniBand is no longer a luxury, it is a requirement for completing training jobs in a reasonable timeframe. Cloud providers have recognized this, and their most advanced AI-focused instances now provide these high-performance interconnects as a standard feature.

Was this section helpful?

References

NVIDIA NVLink, NVIDIA Corporation, 2023 (NVIDIA Corporation) - Provides an authoritative technical overview of NVLink and NVSwitch technologies, explaining their role in high-bandwidth intra-node GPU communication.
What is RDMA?, Rick Merritt, 2020 (NVIDIA Blog) - Offers a clear, authoritative explanation of Remote Direct Memory Access (RDMA) technology and its benefits for high-performance, low-latency communication in distributed systems, especially relevant for InfiniBand.
A Survey of Communication-Efficient Distributed Deep Learning Methods, Haibin Lin, Jin-Hau Li, Xiaogang Zhang, and Bin Luo, 2020 Journal of Parallel and Distributed Computing, Vol. 146 (Elsevier) DOI: 10.1016/j.jpdc.2020.08.012 - Offers a comprehensive review of methods to improve communication efficiency in distributed deep learning, providing context on why high-bandwidth interconnects are essential.