Compute Selection: CPU, GPU, and TPU Architectures

Choosing the right compute hardware is an important optimization problem that goes past raw performance. The architectural differences between CPUs, GPUs, and TPUs create distinct performance profiles and cost efficiencies for different machine learning tasks. An incorrect choice can lead to major resource underutilization, training bottlenecks, and inflated cloud bills. Analyzing architectural trade-offs helps to select the most appropriate processor for your workload.

Central Processing Units (CPUs): The General-Purpose Foundation

CPUs are designed for versatility and low-latency execution of a wide range of tasks. Their architecture consists of a small number of powerful, complex cores, each capable of executing instructions independently and handling complex branching logic. Modern server-class CPUs employ Single Instruction, Multiple Data (SIMD) vector instructions, such as Intel's AVX (Advanced Vector Extensions), to perform the same operation on multiple data points simultaneously.

For machine learning, CPUs remain indispensable for several important areas:

Data Preprocessing and Augmentation: Tasks like reading from storage, parsing data formats (e.g., JSON, CSV), and applying transformations often involve complex, serial logic that is not easily parallelized. The high single-thread performance of a CPU is well-suited for these data pipeline stages.
Traditional Machine Learning: Algorithms that do not rely on large-scale matrix arithmetic, such as gradient-boosted trees (XGBoost, LightGBM) or support vector machines, run efficiently on CPUs.
Low-Latency Inference for Small Models: When serving a single inference request with the lowest possible latency is the primary goal, a CPU can sometimes outperform a GPU. A GPU's overhead for launching a kernel can negate its parallel processing advantage for a small batch size ( $batch\_size = 1$ ), especially if the model itself is not computationally intensive.

However, for the large-scale matrix and tensor operations that define deep learning, the CPU's architecture becomes a limiting factor. The limited number of cores and lower memory bandwidth compared to specialized accelerators create a significant bottleneck, making them impractical for training large neural networks.

Graphics Processing Units (GPUs): The Parallel Processing Engine

GPUs evolved from graphics rendering hardware into the de facto standard for deep learning training. Their architecture is fundamentally different from that of a CPU. A GPU contains thousands of smaller, simpler cores optimized for executing the same instruction across large arrays of data in parallel. This design is a perfect match for the tensor algebra at the heart of neural networks.

Two architectural features of modern data center GPUs are particularly significant for AI workloads:

Tensor Cores: These are specialized hardware units integrated into NVIDIA's GPU architecture (since the Volta generation). Tensor Cores are designed to accelerate matrix multiply-accumulate (MMA) operations, which are the primary computations in deep learning. They are exceptionally efficient at mixed-precision computing. For instance, a Tensor Core can perform a multiplication of two 16-bit floating-point matrices ( $FP16$ or $BF16$ ) and add the result to a 32-bit floating-point ( $FP32$ ) matrix in a single operation. This dramatically increases throughput (measured in FLOPs) and allows for training larger models or reducing training time.
High-Bandwidth Memory (HBM): To prevent the thousands of cores from starving for data, data center GPUs are equipped with HBM. This stacked DRAM technology provides memory bandwidth an order of magnitude higher than the DDR memory used with CPUs. For example, an NVIDIA A100 GPU offers over 1.5 TB/s of memory bandwidth, whereas a high-end CPU's memory system might provide around 200 GB/s. This high bandwidth is essential for feeding data to the compute units during the forward and backward passes of training.

The combination of massive parallelism, specialized Tensor Cores, and high-bandwidth memory makes GPUs the most flexible and widely used accelerator for both training and high-throughput inference of deep learning models.

Tensor Processing Units (TPUs): The Specialized Neural Network Accelerator

Tensor Processing Units (TPUs) are Google's custom-designed Application-Specific Integrated Circuits (ASICs) built for one purpose: accelerating neural network computations at scale. While a GPU is a programmable parallel processor, a TPU is a dedicated matrix processor.

The core of a TPU is its Systolic Array. A systolic array is a network of simple data processing units, called Matrix Multiply Units (MXUs) in Google's architecture. Data is loaded into the processor and then "pumped" through the array in a rhythmic, wave-like fashion. Each processing unit performs a calculation and passes its result to its neighbor. This design maximizes data reuse from the processor's internal memory, drastically reducing the need to access main memory (DRAM) and minimizing power consumption for a given computation.

The primary advantages of TPUs stem from this specialized design:

Extreme Scalability for Training: TPUs are designed to be interconnected into large multi-node configurations called "Pods" via a custom, high-speed, two-dimensional torus interconnect. This allows for near-linear performance scaling when training massive models across thousands of TPU chips.
Peak Performance on Matrix Operations: For large matrix multiplications, the systolic array architecture is exceptionally efficient, often delivering superior performance-per-watt compared to GPUs for specific workloads like large language and vision models.
High-Throughput Inference: The same architectural benefits make TPUs a strong choice for large-batch inference scenarios where maximizing the number of inferences per second is the objective.

The specialization of TPUs also brings limitations. They are less flexible than GPUs and perform best on workloads that can be expressed as large, dense matrix multiplications. Operations involving custom CUDA kernels, significant branching, or sparse data access patterns may perform better on GPUs. Furthermore, their availability is largely restricted to the Google Cloud Platform.

Making the Right Selection

Choosing between CPU, GPU, and TPU involves analyzing your specific workload against the architectural strengths of each processor. The decision process often balances performance, cost, and development flexibility.

A decision flow for selecting compute hardware based on workload characteristics. The optimal choice depends on whether the task is training, inference, or data processing, as well as specific requirements for scale and performance.

Ultimately, the best infrastructure design often uses a combination of these processors. A typical MLOps pipeline might use CPUs for its data ingestion and preprocessing steps, a large cluster of GPUs or TPUs for distributed training, and a fleet of GPUs or even CPUs for serving the resulting model, with each component selected to maximize performance and cost-efficiency for its specific task.

Was this section helpful?

References

In-Datacenter Performance Analysis of a Tensor Processing Unit, Norman P. Jouppi, Cliff Young, David Patil, Dustin Patterson, Gaurav Agrawal, Haixin Liao, Kevin Schardl, Mike Smith, Dave Washington, Zhifeng Chen, Yuan Cao, Branden Holt, Luke Taylor, Alan Knaggs, Rohan Kumar, Sanjay Nigam, Guang Gao, Al Gleason, Robert Horton, Daniel Kini, Danny Kwon, David Leyda, Francois Obermeyer, Rajkumar Rathod, Kevin Ren, Aditya Taraporevala, Grant Vert, Xinan Long, Niels Dekker, Andy Hölzle, Quoc Le, Chengyu Sun, 2017 Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17) (ACM) DOI: 10.1145/3079893.3080246 - Describes the architecture and performance characteristics of Google's first-generation Tensor Processing Unit (TPU).
NVIDIA Volta GV100 GPU Architecture, David B. Kirk, John W. Nickolls, 2017 (NVIDIA Corporation) - Provides specifications and architectural details of the Volta GPU, including the introduction of Tensor Cores.