Masterclass
Training large language models across multiple accelerators, whether GPUs within a single server or across numerous servers in a cluster, introduces a critical dependency: communication. The sheer volume of data that needs to be exchanged, such as gradients during data parallelism or activations and weights during model parallelism, can easily become a bottleneck, stalling the expensive compute units. Standard networking interfaces like typical Gigabit Ethernet are insufficient for these demands. High-performance interconnect technologies are therefore essential components of any LLM training infrastructure. These technologies focus on delivering two primary characteristics: high bandwidth (the rate at which data can be transferred) and low latency (the delay in initiating a data transfer).
Within a single server equipped with multiple GPUs, the primary communication pathway is typically the PCIe bus. While modern PCIe generations (like PCIe 4.0 or 5.0) offer substantial bandwidth, it's a shared resource, and communication between GPUs must often traverse the CPU's memory controller, adding latency. To overcome this, NVIDIA developed NVLink, a proprietary high-speed, point-to-point interconnect directly connecting GPUs.
NVLink allows GPUs within the same server to exchange data directly from their respective High Bandwidth Memory (HBM) without needing to route traffic through the CPU or the main system RAM via the PCIe bus. This results in significantly higher bandwidth and lower latency compared to using PCIe alone.
Comparison of GPU communication pathways. NVLink provides a direct, high-bandwidth link between GPUs, bypassing the slower, shared PCIe bus often used for GPU-CPU and indirect GPU-GPU communication.
Each generation of NVLink has increased the available bandwidth. For example, NVLink 3.0 (used in A100 GPUs) offers up to 600 GB/s of bidirectional bandwidth per GPU, while NVLink 4.0 (used in H100 GPUs) increases this to 900 GB/s. Compare this to a typical PCIe 5.0 x16 slot, which provides 128 GB/s of bidirectional bandwidth. This dramatic difference is particularly impactful for model parallelism techniques (tensor and pipeline parallelism) where large intermediate activations or weight segments must be frequently exchanged between GPUs working on different parts of the same model layer or different layers entirely. It also significantly accelerates collective communication operations like AllReduce
, commonly used for synchronizing gradients in data parallelism, when performed across GPUs within a single node.
When scaling LLM training beyond a single server, a high-performance network fabric connecting the nodes becomes necessary. This fabric needs to handle communication between potentially hundreds or thousands of GPUs distributed across many machines. The primary contenders in this space are InfiniBand and high-speed Ethernet, often utilizing RDMA.
InfiniBand is a high-performance computing network standard designed from the ground up for low latency and high bandwidth. It operates as a switched fabric, similar to Ethernet, but with several main differences optimized for HPC and AI workloads:
InfiniBand has historically been the preferred networking choice for large-scale AI training clusters due to its consistent low latency and mature RDMA implementation.
Ethernet is ubiquitous, and advancements have pushed its speeds into the hundreds of gigabits per second (e.g., 200 GbE, 400 GbE). To compete with InfiniBand's low-latency capabilities for HPC/AI, the RoCE (RDMA over Converged Ethernet) protocol was developed. RoCE allows RDMA operations to run over standard Ethernet infrastructure.
The main advantage of RoCE is its potential to leverage existing Ethernet infrastructure and expertise, potentially offering a lower cost of entry. However, achieving the necessary lossless behavior across a large Ethernet fabric can be complex to configure and manage compared to InfiniBand, which has lossless operation built-in. Performance-wise, modern high-speed Ethernet with well-configured RoCE can achieve latency and bandwidth very close to contemporary InfiniBand speeds.
The way servers are connected (the network topology) also impacts communication performance, especially for collective operations involving many nodes. Topologies like Fat-Tree or Dragonfly are commonly used in large clusters. They aim to provide high bisection bandwidth, meaning there is ample capacity for communication even when many nodes need to exchange data simultaneously across different parts of the network. An inadequate topology can lead to bottlenecks even if individual link speeds are high.
Deep learning frameworks and communication libraries abstract away many of the hardware details. Libraries like NVIDIA's NCCL (NVIDIA Collective Communications Library) are highly optimized for collective operations (e.g., AllReduce
, Broadcast
, AllGather
) on NVIDIA GPUs.
import torch
import torch.distributed as dist
import os
def setup_distributed(backend='nccl'):
"""Initializes the distributed process group."""
# Assumes environment variables MASTER_ADDR, MASTER_PORT,
# RANK, and WORLD_SIZE are set.
rank = int(os.environ['RANK'])
world_size = int(os.environ['WORLD_SIZE'])
master_addr = os.environ['MASTER_ADDR']
master_port = os.environ['MASTER_PORT']
# Initialize the process group
# NCCL backend will automatically try to use NVLink for intra-node
# and InfiniBand/RoCE (if available and configured)
# for inter-node communication.
dist.init_process_group(
backend=backend,
init_method=f'tcp://{master_addr}:{master_port}',
rank=rank,
world_size=world_size
)
# Set the device for the current process
torch.cuda.set_device(rank % torch.cuda.device_count())
print(
f"Rank {rank}/{world_size} initialized on device "
f"{torch.cuda.current_device()}."
)
# --- Example Usage ---
# if __name__ == "__main__":
# setup_distributed()
#
# # Example of a collective operation
# tensor = (torch.ones(1, device=torch.cuda.current_device())
# * dist.get_rank())
# print(f"Rank {dist.get_rank()} before all_reduce: {tensor}")
#
# # NCCL handles routing this over NVLink/InfiniBand/RoCE
# dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
#
# print(f"Rank {dist.get_rank()} after all_reduce: {tensor}")
#
# dist.destroy_process_group()
PyTorch code initializing a distributed process group using the NCCL backend. NCCL intelligently selects the best available interconnect (NVLink, InfiniBand, RoCE) for communication operations like
dist.all_reduce
.
When torch.distributed
is initialized with the nccl
backend, NCCL probes the system hardware and automatically utilizes NVLink for fast communication between GPUs on the same node and InfiniBand or RoCE (if available and properly configured) for communication between GPUs on different nodes. Selecting and configuring the right interconnect technology is therefore a hardware and infrastructure task, but its benefits are realized through these high-level software libraries during training.
In summary, high-speed interconnects are non-negotiable components for efficient large-scale LLM training. NVLink provides the essential high-bandwidth, low-latency pathway between GPUs within a server, critical for model parallelism and fast intra-node collectives. InfiniBand and high-speed Ethernet (with RoCE) provide the necessary network fabric to scale training across multiple servers, enabling large data and pipeline parallelism schemes. The choice between InfiniBand and Ethernet/RoCE involves trade-offs in performance consistency, cost, and configuration complexity, but both aim to minimize the communication overhead that can otherwise severely limit training throughput.
© 2025 ApX Machine Learning