While GPUs and specialized accelerators often handle the bulk of the computation in ML workloads, the CPU remains a significant component. It executes parts of the model graph, drives the accelerator, manages the runtime system, handles data loading and preprocessing, and runs control flow logic. Optimizing CPU performance is therefore essential for overall application speed. This section details how to use specialized CPU profiling tools, specifically Intel VTune Profiler and Linux perf
, to dissect the performance of compiled ML code running on the CPU.
As highlighted in the chapter introduction, profiling code transformed by ML compilers presents unique challenges. Function names might be mangled or correspond to large, fused kernels, making direct correlation to the original model difficult. Furthermore, the heavy use of libraries (like MKL-DNN/oneDNN, OpenBLAS) means performance often depends on these pre-optimized routines. Effective CPU profiling requires tools that can look past the source code level and analyze the actual hardware execution details.
Intel VTune Profiler is a powerful performance analysis tool for understanding CPU (and GPU/FPGA) behavior. It provides a graphical interface and multiple analysis types suitable for deep-diving into ML workloads.
You typically run VTune by either launching your ML application under its control or attaching it to an already running process. VTune collects data and presents it in various views (e.g., summary, bottom-up, caller/callee). You can often drill down from function-level summaries to source code or assembly, although correlating highly optimized, generated code back to the original ML graph operation requires understanding the compiler's transformations. Look for annotations showing PMU event counts directly on assembly instructions for fine-grained analysis.
Example breakdown of CPU time from a VTune Microarchitecture Exploration, showing potential bottlenecks like memory stalls (L1/L2/L3 Bound) or instruction starvation (Frontend Bound) versus useful work (Retiring).
KernelA_Fused
shows significant time potentially stalled on memory access.
perf
Linux perf
is a powerful, versatile command-line profiling tool built into the Linux kernel. It leverages the CPU's Performance Monitoring Units (PMUs) to sample or count hardware events with low overhead.
perf
Usageperf stat
: Provides aggregate counts for common hardware events (cycles, instructions, cache misses, branch misses) for a command or process ID. Useful for a quick overview of performance characteristics.
# Count events for the entire execution of an ML inference script
perf stat python run_inference.py --model compiled_model.bin
perf record
: Samples the program's execution. It periodically records the instruction pointer and other information (like the call stack using the -g
option). This generates a perf.data
file.
# Record performance data with call graphs
perf record -g ./my_compiled_ml_app --input data.npy
perf report
: Analyzes the perf.data
file generated by perf record
. It presents a hierarchical view showing the percentage of samples collected in each function, library, or even individual instruction. You can navigate this report interactively in the terminal to explore hotspots and call chains.
perf annotate
: Disassembles functions identified as hotspots by perf report
and annotates the assembly code with the percentage of samples that occurred at each instruction. This is invaluable for pinpointing the exact instructions causing stalls or consuming cycles, especially within compiler-generated code.
perf
Strengths and Weaknessesperf
for Compiled MLSimilar to VTune, you use perf record
to capture data while your compiled model executes. perf report
will often show hotspots within the ML runtime's execution engine, specific kernels generated by the ML compiler, or within vendor libraries like oneDNN or OpenBLAS. Using perf annotate
on these hotspot functions allows you to examine the assembly code and see where hardware events (like cache misses recorded via perf record -e cache-misses ...
) are occurring frequently. This can directly inform compiler development or tuning. For instance, seeing numerous cache misses on load instructions within a generated loop nest might suggest that the compiler's tiling or prefetching strategy needs adjustment.
Both VTune and perf
provide data; the skill lies in interpreting it within the context of ML compilation.
By systematically applying these CPU profiling tools, you can move beyond guesswork and obtain concrete data about how your compiled ML code interacts with the hardware. This data is essential for diagnosing performance limitations and guiding the sophisticated compiler and runtime optimizations discussed throughout this course.
© 2025 ApX Machine Learning