Introduction to TPUs and other ASICs

While CPUs offer great flexibility and GPUs provide massive parallelism for a range of tasks, there is another class of processor built for maximum efficiency on a single job: the Application-Specific Integrated Circuit, or ASIC. For certain large-scale AI workloads, these specialized chips offer performance and power efficiency that general-purpose hardware cannot match.

What Is an ASIC?

An Application-Specific Integrated Circuit is a chip designed and manufactured for one particular purpose. Unlike a CPU, which is engineered to run a full operating system and countless types of software, an ASIC has its logic gates physically laid out to perform a very narrow set of operations. Think of it as the difference between a programmable kitchen appliance that can mix, chop, and blend, versus a simple, highly efficient coffee grinder. The grinder does only one thing, but it does it faster and with less energy than the general-purpose appliance.

In the context of machine learning, ASICs are designed to accelerate the fundamental mathematical operations of neural networks, primarily matrix multiplication and convolution. By stripping away all unnecessary components, such as the complex branch prediction of a CPU or the graphics-rendering pipelines of a GPU, an AI ASIC can dedicate all of its silicon and power budget to these core calculations.

Google's Tensor Processing Unit: A Prime Example

The most prominent example of an AI ASIC is Google's Tensor Processing Unit (TPU). First developed internally to accelerate inference for services like Google Search and Photos, TPUs are now available to the public through the Google Cloud Platform. They are designed from the ground up to execute the operations defined in machine learning frameworks like TensorFlow and JAX with extreme speed and efficiency.

The architectural innovation at the heart of the TPU is the systolic array.

The Systolic Array Architecture

A systolic array is a grid of simple, identical processing elements (PEs) that are connected to their nearest neighbors. Data flows through this grid in a rhythmic, wave-like pattern, similar to how blood is pumped through the circulatory system, which is where the name "systolic" comes from.

Here is how it works for a matrix multiplication, $C = A \cdot B$ :

Load Weights: The values from one matrix (e.g., the model's weights) are pre-loaded into the array of processing elements.
Stream Data: The values from the other matrix (e.g., the input activations) are streamed into the array from one edge.
Compute and Pass: As the input data streams through, each PE performs a multiply-accumulate (MAC) operation. It multiplies the incoming activation by its stored weight and adds the result to the value passed from its neighbor. It then passes its own input data and the accumulated result to the next PE in the sequence.

This design is incredibly efficient because data is constantly moving and being computed upon. The values of the weight matrix are reused many times without needing to be fetched from memory repeatedly, which dramatically reduces memory bandwidth bottlenecks and power consumption.

Diagram of a systolic array. Activations stream from the top, while weights are loaded from the side. Each Processing Element (PE) performs a calculation and passes data to its neighbors, resulting in a highly parallel and efficient computation flow.

The Trade-Off: Performance vs. Flexibility

The specialization of ASICs leads to a clear trade-off. For the tasks they were designed for, their performance and power efficiency can be an order of magnitude better than even high-end GPUs. This is often measured in Tera Operations Per Second (TOPS) and, more importantly, TOPS-per-watt.

Normalized performance-per-watt for a typical matrix-heavy AI workload across different processor types. The specialized nature of the TPU allows it to perform its target operations with significantly less power.

However, this performance comes at the cost of flexibility. A TPU is not a general-purpose processor. It cannot run arbitrary Python code or render a user interface. It is optimized for a specific set of operations and data types (like bfloat16 or INT8). If your model uses a new, unsupported operation, you cannot run it on a TPU without modifying your model or waiting for hardware and software support to be added.

This creates a spectrum of hardware choices:

CPU: Highest flexibility, lowest parallel performance.
GPU: Good balance of flexibility and high parallel performance.
ASIC: Lowest flexibility, highest parallel performance for specific tasks.

Past TPUs: A Growing Ecosystem

While Google's TPU is the most well-known AI ASIC, it is far from the only one. The demand for efficient AI computation has led to the development of other specialized chips:

AWS Trainium and Inferentia: Amazon's custom ASICs, designed specifically for cost-effective model training (Trainium) and inference (Inferentia) within the AWS ecosystem.
Specialized Startups: Companies like Cerebras Systems, SambaNova Systems, and Groq have developed novel architectures to tackle AI computation at massive scale, often with unique approaches to memory and dataflow.

The existence of these diverse solutions indicates that for organizations operating at sufficient scale, designing or using custom silicon is a viable strategy for optimizing performance and cost. For most engineers and data scientists, the immediate choice will be between using these ASICs through cloud providers or sticking with more traditional GPU-based infrastructure. Your decision will depend on your workload's scale, your budget, and how much your model aligns with the ASIC's specialized capabilities.

Was this section helpful?

References

In-Datacenter Performance Analysis of a Tensor Processing Unit, Norman P. Jouppi, Cliff Young, Nishant Patil, David A. Patterson, et al., 2017 Proceedings of the 44th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA '17) (IEEE and ACM) DOI: 10.1145/3079856.3080246 - This seminal paper introduces the Google Tensor Processing Unit (TPU), detailing its architecture, the systolic array, and its performance characteristics for machine learning inference workloads.
Tensor Processing Units (TPUs) | Google Cloud, Google Cloud, 2023 (Google Cloud) - Official documentation providing current details on Google's TPUs, their capabilities, and how to utilize them within the Google Cloud ecosystem.
The Deep Learning Hardware Landscape, Albert Reuther, Jeremy Kepner, Andrew P. Sage, Jeremy R. Holland, Robert T. Bond, Jeffrey K. Oddson, Roger Pearce, Charles M. Leiserson, William M. McMahon, and Peter A. Michaleas, 2021 MIT Lincoln Laboratory Journal, Vol. 22 (MIT Lincoln Laboratory) - This survey article presents an overview of hardware options for deep learning, discussing the strengths and weaknesses of CPUs, GPUs, and specialized ASICs, illustrating their respective trade-offs.