Hands-on Practical: Benchmarking CPU vs GPU

This hands-on lab focuses on direct performance measurement. A simple Python script will perform a common machine learning operation, matrix multiplication, and benchmark its execution time on both a CPU and a GPU. This exercise offers a tangible sense of the acceleration a GPU provides for parallelizable tasks.

The goal is not to conduct a rigorous scientific study, but to observe the orders-of-magnitude difference that drives hardware selection in AI.

The Benchmark Task: Matrix Multiplication

At the heart of most neural networks lies a fundamental operation: matrix multiplication. A layer of a neural network can often be expressed as:

\text{output} = \text{activation}(\text{weights} \cdot \text{inputs} + \text{biases})

The operation $\text{weights} \cdot \text{inputs}$ is a large matrix multiplication. Calculating each element in the resulting matrix can be done independently of the others, making this a "pleasantly parallel" problem. This is precisely the kind of workload where a GPU's architecture, with its thousands of cores, is expected to outperform a CPU's few, more powerful cores.

Prerequisites

To follow along, you will need a Python environment with PyTorch installed. PyTorch is a popular deep learning framework that provides a simple interface for running computations on different hardware devices.

If you have an NVIDIA GPU, ensure you install the version of PyTorch compiled with CUDA support. You can find the correct command on the PyTorch official website. It will typically look something like this:

# Example command, check the official site for the latest version
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

If you do not have a local GPU, you can run this entire exercise for free on Google Colab, which provides GPU-enabled environments. Simply open a new Colab notebook, and in the menu, select Runtime > Change runtime type and choose GPU as the hardware accelerator.

The Benchmark Code

We will write a script that performs the same task, a series of large matrix multiplications, first on the CPU and then on the GPU. We use PyTorch for both to ensure a fair comparison, as the underlying implementation of the operation is the same; only the hardware executing it changes.

Create a file named benchmark.py and add the following code:

import torch
import time

def benchmark(device_name, matrix_size=4096, iterations=10):
    """
    Performs matrix multiplication benchmark on a specified device.

    Args:
        device_name (str): The device to run on, 'cpu' or 'cuda'.
        matrix_size (int): The dimension of the square matrices.
        iterations (int): The number of times to repeat the multiplication.
    """
    print(f"--- Benchmarking on {device_name.upper()} ---")

    # Set the device for tensor allocation
    device = torch.device(device_name)

    # Create two large random matrices on the specified device
    # Using .to(device) moves the tensor to the target hardware
    try:
        a = torch.randn(matrix_size, matrix_size, device=device)
        b = torch.randn(matrix_size, matrix_size, device=device)
    except torch.cuda.OutOfMemoryError:
        print(f"GPU out of memory. Try a smaller matrix_size.")
        return
    except Exception as e:
        print(f"An error occurred: {e}")
        return

    # Warm-up run to handle any initial setup costs
    _ = torch.matmul(a, b)

    # For GPUs, synchronize before starting the timer to ensure
    # all previous operations are complete.
    if device.type == 'cuda':
        torch.cuda.synchronize()

    start_time = time.time()

    for _ in range(iterations):
        c = torch.matmul(a, b)

    # For GPUs, synchronize again to ensure the multiplications
    # are finished before we stop the timer.
    if device.type == 'cuda':
        torch.cuda.synchronize()

    end_time = time.time()

    total_time = end_time - start_time
    avg_time_per_iter = (total_time / iterations) * 1000  # in milliseconds

    print(f"Matrix Size: {matrix_size}x{matrix_size}")
    print(f"Iterations: {iterations}")
    print(f"Total time: {total_time:.4f} seconds")
    print(f"Average time per multiplication: {avg_time_per_iter:.4f} ms\n")

if __name__ == "__main__":
    # Benchmark on CPU
    benchmark('cpu')

    # Check if a CUDA-enabled GPU is available
    if torch.cuda.is_available():
        # Benchmark on GPU
        benchmark('cuda')
    else:
        print("--- CUDA (GPU) not available on this system. ---")
        print("Skipping GPU benchmark.")

Understanding the Code

Device Handling: The torch.device(device_name) call creates a device object. When we create our tensors a and b, we pass device=device to ensure they are created directly on the target hardware (CPU RAM or GPU VRAM).
torch.cuda.synchronize(): This is an important function for accurate GPU timing. CPU code and GPU code run asynchronously. When you call torch.matmul(a, b) on a GPU, the CPU queues the operation and may immediately move to the next line of code (like stopping the timer) before the GPU has actually finished the calculation. torch.cuda.synchronize() forces the CPU to wait until all previously queued GPU tasks are complete, giving us an accurate measurement of the wall-clock time.

Running the Benchmark

Open your terminal, navigate to the directory where you saved benchmark.py, and run it:

python benchmark.py

Analyzing the Results

You will see output similar to this, though the exact numbers will vary greatly depending on your specific CPU and GPU models.

--- Benchmarking on CPU ---
Matrix Size: 4096x4096
Iterations: 10
Total time: 11.8921 seconds
Average time per multiplication: 1189.2100 ms

--- Benchmarking on GPU ---
Matrix Size: 4096x4096
Iterations: 10
Total time: 0.0754 seconds
Average time per multiplication: 7.5400 ms

The difference is stark. In this example run, the GPU performed the same task over 150 times faster than the CPU. This isn't because the CPU is "bad". It's simply the wrong tool for this specific job. The CPU's cores are busy sequentially processing instructions, while the GPU's thousands of cores each handle a small piece of the matrix multiplication simultaneously.

Execution time for 10 matrix multiplications of size 4096x4096. The GPU completes the task in a fraction of the time required by the CPU. Lower is better.

This practical result is the central reason why GPUs became the de facto standard for deep learning. The ability to accelerate these core parallel operations by orders of magnitude directly translates into training models in hours instead of weeks. As we move forward, keep this fundamental performance gap in mind. It influences every decision we make, from choosing a cloud instance to designing a multi-node training cluster.

Was this section helpful?

References

PyTorch Documentation, PyTorch Contributors, 2023 (PyTorch Foundation) - Official documentation for the PyTorch deep learning framework, covering its tensor operations, device management, and CUDA integration essential for CPU/GPU benchmarking.
Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - A foundational textbook that provides a comprehensive theoretical background on neural networks, including the role of matrix multiplication in deep learning models.
Programming Massively Parallel Processors: A Hands-on Approach, David B. Kirk, Wen-mei W. Hwu, 2017 (Morgan Kaufmann) DOI: 10.1016/C2016-0-03099-X - This book explains the architecture and programming principles of GPUs, providing deep insights into why they excel at parallelizable tasks like matrix multiplication for high-performance computing.