Profiling GPU and CPU Usage

To optimize your system, you must first understand its behavior. Simply running a training job and waiting for it to finish provides no insight into why it took as long as it did. Profiling is the process of measuring the resource consumption and execution time of different parts of your application. It allows you to move from guessing about performance to making data-driven decisions. An underutilized GPU is a common problem, often caused by a CPU-bound data pipeline or inefficient operations. By using the right tools, you can pinpoint exactly where your system is spending its time and identify the primary bottlenecks to address.

High-Level System Monitoring

Before exploring complex profiling libraries, a simple, real-time check of your system's resources can provide immediate clues. These command-line tools are the first step in any performance investigation.

Monitoring CPU and Memory with `htop`

The htop utility is an interactive process viewer for Unix-like systems and a significant improvement over the traditional top command. It gives you a color-coded, real-time view of your CPU cores, memory usage, and a list of running processes sorted by resource consumption.

When you run your training script, open another terminal and run htop. Watch the CPU meters at the top. If you see one or more CPU cores consistently at 100% while your GPU sits idle, you likely have a CPU bottleneck. This often occurs during data preprocessing, where tasks like image augmentation or text tokenization are running on the CPU before the data is sent to the GPU.

Monitoring the GPU with `nvidia-smi`

For any system with NVIDIA GPUs, the NVIDIA System Management Interface (nvidia-smi) is an indispensable tool. It provides a detailed summary of the status of all available GPUs. To get a continuous, updating view, run it with the watch command.

watch -n 1 nvidia-smi

This command refreshes the output every second. Here's how to interpret the important fields:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    59W / 400W |   1540MiB / 40960MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+

GPU-Util: This is the most important metric for training. It shows the percentage of time one or more kernels were executing on the GPU during the last second. For a well-optimized deep learning workload, this number should be consistently high, ideally above 90%. If it's fluctuating wildly or staying low, your GPU is waiting for data.
Memory-Usage: This shows how much of the GPU's memory (VRAM) is being used. If this value is close to the GPU's capacity, you may run into out-of-memory (OOM) errors. This is a different kind of problem, often solved by reducing the batch size or using techniques like mixed-precision training.
Pwr:Usage/Cap: This indicates the current power draw relative to the GPU's maximum capacity. High power draw is another sign that the GPU is working hard.

If nvidia-smi shows low GPU-Util, your investigation should immediately turn to the data pipeline and CPU-bound operations.

Granular Analysis with Framework Profilers

While nvidia-smi tells you if your GPU is busy, it doesn't tell you what it's doing. To understand the performance of individual operations within your model, you need to use a profiler built into your machine learning framework. These tools can break down the execution time of every function, kernel launch, and memory copy.

Profiling in PyTorch

PyTorch includes a built-in profiler, torch.profiler, that is easy to integrate into your training script. You simply wrap the code you want to analyze, typically the training loop, in a context manager. The profiler can track both CPU and GPU activities.

Here is a basic example of how to use it:

import torch
import torchvision.models as models
from torch.profiler import profile, record_function, ProfilerActivity

model = models.resnet18().cuda()
inputs = torch.randn(16, 3, 224, 224).cuda()

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    with_stack=True
) as prof:
    with record_function("model_inference"):
        model(inputs)

# Print a summary to the console
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

# For a more detailed view, export to TensorBoard
# prof.export_chrome_trace("trace.json")

The output of prof.key_averages().table() gives you a list of operations, showing the time spent on the CPU and GPU for each. This is extremely useful for finding which specific layers or functions are taking the most time. For an even more powerful view, you can export the results as a Chrome trace file or view them in TensorBoard. The trace view shows a timeline of operations, making it easy to spot gaps where the GPU is idle.

Profiling in TensorFlow

TensorFlow's profiler is tightly integrated with TensorBoard. The easiest way to capture a profile is by using the TensorBoard callback during model training with tf.keras. You can configure it to profile specific batches.

import tensorflow as tf

# ... model and data setup ...

# Create a TensorBoard callback
log_dir = "logs/profile/"
tensorboard_callback = tf.keras.callbacks.TensorBoard(
    log_dir=log_dir,
    profile_batch='10,20' # Profile batches 10 through 20
)

model.fit(
    train_dataset,
    epochs=5,
    callbacks=[tensorboard_callback]
)

After running this, launch TensorBoard and navigate to the "Profile" tab. You will find several tools:

Overview Page: This gives you a high-level summary and recommendations. It will often tell you directly if your program is input-bound.
Trace Viewer: This is the most valuable tool. Like the PyTorch trace, it provides a timeline of all operations on the CPU and GPU.
Input Pipeline Analyzer: This tool specifically analyzes your tf.data pipeline to find bottlenecks in data loading and preprocessing.

Visualizing the Performance Timeline

The trace viewer available in both PyTorch and TensorFlow profilers is important to understanding the interaction between the CPU and GPU. A common performance issue is a data loading pipeline that cannot keep up with the GPU. The trace viewer makes this problem obvious.

A performance trace showing a GPU-bound workload. The GPU finishes its work and then enters an idle state, waiting for the CPU to load, preprocess, and transfer the next batch of data. These gaps in the GPU timeline represent wasted compute capacity.

In this diagram, the GPU executes a batch of work quickly and then sits idle. During this idle time, the CPU is busy preparing the next batch. The goal of optimization is to pipeline these activities so that the CPU is always preparing a future batch while the GPU is working on the current one, thus eliminating the "IDLE" gaps on the GPU timeline. By analyzing a real trace from your profiler, you can measure the duration of these gaps and confirm that the input pipeline is indeed your bottleneck.

Was this section helpful?

References

PyTorch Profiler, PyTorch Authors, 2025 (PyTorch) - Explains the torch.profiler API, its usage, and integration with TensorBoard for analyzing CPU and GPU activities in PyTorch applications.
NVIDIA System Management Interface, NVIDIA Corporation, 2025 (NVIDIA) - Official documentation for nvidia-smi, detailing its commands, output interpretation, and how to use it for monitoring NVIDIA GPU health and performance.
Deep Learning Performance Documentation, NVIDIA Corporation, 2023 (NVIDIA) - A collection of guides and best practices from NVIDIA for optimizing deep learning workloads on NVIDIA GPUs, covering various performance strategies and tools for identifying bottlenecks.

Profiling GPU and CPU Usage

High-Level System Monitoring

Monitoring CPU and Memory with `htop`

Monitoring the GPU with `nvidia-smi`

watch -n 1 nvidia-smi

This command refreshes the output every second. Here's how to interpret the important fields:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    59W / 400W |   1540MiB / 40960MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+

GPU-Util: This is the most important metric for training. It shows the percentage of time one or more kernels were executing on the GPU during the last second. For a well-optimized deep learning workload, this number should be consistently high, ideally above 90%. If it's fluctuating wildly or staying low, your GPU is waiting for data.
Memory-Usage: This shows how much of the GPU's memory (VRAM) is being used. If this value is close to the GPU's capacity, you may run into out-of-memory (OOM) errors. This is a different kind of problem, often solved by reducing the batch size or using techniques like mixed-precision training.
Pwr:Usage/Cap: This indicates the current power draw relative to the GPU's maximum capacity. High power draw is another sign that the GPU is working hard.

If nvidia-smi shows low GPU-Util, your investigation should immediately turn to the data pipeline and CPU-bound operations.

Granular Analysis with Framework Profilers

Profiling in PyTorch

Here is a basic example of how to use it:

import torch
import torchvision.models as models
from torch.profiler import profile, record_function, ProfilerActivity

model = models.resnet18().cuda()
inputs = torch.randn(16, 3, 224, 224).cuda()

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    with_stack=True
) as prof:
    with record_function("model_inference"):
        model(inputs)

# Print a summary to the console
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

# For a more detailed view, export to TensorBoard
# prof.export_chrome_trace("trace.json")

Profiling in TensorFlow

import tensorflow as tf

# ... model and data setup ...

# Create a TensorBoard callback
log_dir = "logs/profile/"
tensorboard_callback = tf.keras.callbacks.TensorBoard(
    log_dir=log_dir,
    profile_batch='10,20' # Profile batches 10 through 20
)

model.fit(
    train_dataset,
    epochs=5,
    callbacks=[tensorboard_callback]
)

After running this, launch TensorBoard and navigate to the "Profile" tab. You will find several tools:

Overview Page: This gives you a high-level summary and recommendations. It will often tell you directly if your program is input-bound.
Trace Viewer: This is the most valuable tool. Like the PyTorch trace, it provides a timeline of all operations on the CPU and GPU.
Input Pipeline Analyzer: This tool specifically analyzes your tf.data pipeline to find bottlenecks in data loading and preprocessing.

Visualizing the Performance Timeline

A performance trace showing a GPU-bound workload. The GPU finishes its work and then enters an idle state, waiting for the CPU to load, preprocess, and transfer the next batch of data. These gaps in the GPU timeline represent wasted compute capacity.

Was this section helpful?

References

PyTorch Profiler, PyTorch Authors, 2025 (PyTorch) - Explains the torch.profiler API, its usage, and integration with TensorBoard for analyzing CPU and GPU activities in PyTorch applications.
NVIDIA System Management Interface, NVIDIA Corporation, 2025 (NVIDIA) - Official documentation for nvidia-smi, detailing its commands, output interpretation, and how to use it for monitoring NVIDIA GPU health and performance.
Deep Learning Performance Documentation, NVIDIA Corporation, 2023 (NVIDIA) - A collection of guides and best practices from NVIDIA for optimizing deep learning workloads on NVIDIA GPUs, covering various performance strategies and tools for identifying bottlenecks.

Profiling GPU and CPU Usage

High-Level System Monitoring

Monitoring CPU and Memory with htop

Monitoring the GPU with nvidia-smi

Granular Analysis with Framework Profilers

Profiling in PyTorch

Profiling in TensorFlow

Visualizing the Performance Timeline

Profiling GPU and CPU Usage

High-Level System Monitoring

Monitoring CPU and Memory with htop

Monitoring the GPU with nvidia-smi

Granular Analysis with Framework Profilers

Profiling in PyTorch

Profiling in TensorFlow

Visualizing the Performance Timeline

Monitoring CPU and Memory with `htop`

Monitoring the GPU with `nvidia-smi`

Monitoring CPU and Memory with `htop`

Monitoring the GPU with `nvidia-smi`