To optimize your system, you must first understand its behavior. Simply running a training job and waiting for it to finish provides no insight into why it took as long as it did. Profiling is the process of measuring the resource consumption and execution time of different parts of your application. It allows you to move from guessing about performance to making data-driven decisions. An underutilized GPU is a common problem, often caused by a CPU-bound data pipeline or inefficient operations. By using the right tools, you can pinpoint exactly where your system is spending its time and identify the primary bottlenecks to address.
Before exploring complex profiling libraries, a simple, real-time check of your system's resources can provide immediate clues. These command-line tools are the first step in any performance investigation.
htopThe htop utility is an interactive process viewer for Unix-like systems and a significant improvement over the traditional top command. It gives you a color-coded, real-time view of your CPU cores, memory usage, and a list of running processes sorted by resource consumption.
When you run your training script, open another terminal and run htop. Watch the CPU meters at the top. If you see one or more CPU cores consistently at 100% while your GPU sits idle, you likely have a CPU bottleneck. This often occurs during data preprocessing, where tasks like image augmentation or text tokenization are running on the CPU before the data is sent to the GPU.
nvidia-smiFor any system with NVIDIA GPUs, the NVIDIA System Management Interface (nvidia-smi) is an indispensable tool. It provides a detailed summary of the status of all available GPUs. To get a continuous, updating view, run it with the watch command.
watch -n 1 nvidia-smi
This command refreshes the output every second. Here's how to interpret the important fields:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:00:04.0 Off | 0 |
| N/A 38C P0 59W / 400W | 1540MiB / 40960MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
If nvidia-smi shows low GPU-Util, your investigation should immediately turn to the data pipeline and CPU-bound operations.
While nvidia-smi tells you if your GPU is busy, it doesn't tell you what it's doing. To understand the performance of individual operations within your model, you need to use a profiler built into your machine learning framework. These tools can break down the execution time of every function, kernel launch, and memory copy.
PyTorch includes a built-in profiler, torch.profiler, that is easy to integrate into your training script. You simply wrap the code you want to analyze, typically the training loop, in a context manager. The profiler can track both CPU and GPU activities.
Here is a basic example of how to use it:
import torch
import torchvision.models as models
from torch.profiler import profile, record_function, ProfilerActivity
model = models.resnet18().cuda()
inputs = torch.randn(16, 3, 224, 224).cuda()
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
with_stack=True
) as prof:
with record_function("model_inference"):
model(inputs)
# Print a summary to the console
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
# For a more detailed view, export to TensorBoard
# prof.export_chrome_trace("trace.json")
The output of prof.key_averages().table() gives you a list of operations, showing the time spent on the CPU and GPU for each. This is extremely useful for finding which specific layers or functions are taking the most time. For an even more powerful view, you can export the results as a Chrome trace file or view them in TensorBoard. The trace view shows a timeline of operations, making it easy to spot gaps where the GPU is idle.
TensorFlow's profiler is tightly integrated with TensorBoard. The easiest way to capture a profile is by using the TensorBoard callback during model training with tf.keras. You can configure it to profile specific batches.
import tensorflow as tf
# ... model and data setup ...
# Create a TensorBoard callback
log_dir = "logs/profile/"
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir=log_dir,
profile_batch='10,20' # Profile batches 10 through 20
)
model.fit(
train_dataset,
epochs=5,
callbacks=[tensorboard_callback]
)
After running this, launch TensorBoard and navigate to the "Profile" tab. You will find several tools:
tf.data pipeline to find bottlenecks in data loading and preprocessing.The trace viewer available in both PyTorch and TensorFlow profilers is important to understanding the interaction between the CPU and GPU. A common performance issue is a data loading pipeline that cannot keep up with the GPU. The trace viewer makes this problem obvious.
A performance trace showing a GPU-bound workload. The GPU finishes its work and then enters an idle state, waiting for the CPU to load, preprocess, and transfer the next batch of data. These gaps in the GPU timeline represent wasted compute capacity.
In this diagram, the GPU executes a batch of work quickly and then sits idle. During this idle time, the CPU is busy preparing the next batch. The goal of optimization is to pipeline these activities so that the CPU is always preparing a future batch while the GPU is working on the current one, thus eliminating the "IDLE" gaps on the GPU timeline. By analyzing a real trace from your profiler, you can measure the duration of these gaps and confirm that the input pipeline is indeed your bottleneck.
Was this section helpful?
torch.profiler API, its usage, and integration with TensorBoard for analyzing CPU and GPU activities in PyTorch applications.nvidia-smi, detailing its commands, output interpretation, and how to use it for monitoring NVIDIA GPU health and performance.© 2026 ApX Machine LearningEngineered with