Home
Blog
Courses
LLMs
EN
All Courses
Advanced Compiler and Runtime Optimizations for ML Workloads
Chapter 1: Foundation: ML Execution Stack and Challenges
The ML Model Deployment Gap
Overview of ML Compiler and Runtime Stacks
Performance Bottlenecks in ML Inference
Hardware for ML Acceleration
The Need for Specialized Optimizations
Chapter 2: Advanced Intermediate Representations for ML
Limitations of Traditional Compiler IRs
Principles of Multi-Level IRs
MLIR: Dialects and Operations
Representing High-Level ML Graphs (e.g., TF, TOSA)
Lowering Paths within MLIR
Extensibility and Custom Dialects
Hands-on Practical: Analyzing MLIR Representations
Chapter 3: Advanced Graph-Level Optimizations
Graph Rewriting Systems
Aggressive Operator Fusion Techniques
Memory-Aware Layout Transformations
Advanced Algebraic Simplification
Static Memory Planning and Allocation
Handling Control Flow in Graphs
Hands-on Practical: Implementing a Fusion Pass
Chapter 4: Tensor-Level Optimizations and Polyhedral Modeling
Representing Tensor Computations as Loop Nests
Introduction to Polyhedral Modeling
Iteration Domains, Access Functions, and Dependencies
Scheduling Transformations (Skewing, Tiling)
Code Generation from Polyhedral Schedules
Auto-Vectorization Techniques (SIMD)
Memory Hierarchy Optimization: Tiling and Prefetching
Hands-on Practical: Optimizing Loops with Polyhedral Tools
Chapter 5: Code Generation for Heterogeneous Hardware
Target-Specific Instruction Selection
Register Allocation for Vector/Matrix Units
GPU Code Generation: CUDA and ROCm Backends
Generating Code for Tensor Cores and Matrix Units
Targeting AI Accelerators (TPUs, NPUs)
Intermediate Formats for Heterogeneous Execution (SPIR-V)
Vendor-Specific Compiler Toolchains and Libraries (cuDNN, MIOpen)
Hands-on Practical: Analyzing Generated GPU Kernels
Chapter 6: Advanced Runtime Systems for ML
Runtime Architecture Overview
Handling Dynamic Shapes and Sizes
Efficient Memory Management Strategies
Asynchronous Execution and Scheduling
Scheduling for Heterogeneous Systems
Integrating Custom Operators and Kernels
Interoperability with ML Frameworks
Hands-on Practical: Implementing a Simple Allocator
Chapter 7: Just-In-Time (JIT) Compilation Techniques for ML
Motivation for JIT Compilation in ML
Tracing vs. Scripting Approaches
Intermediate Representation in JIT Systems
Runtime Specialization and Polymorphism
Profile-Guided Optimization (PGO) in JITs
Adaptive and Multi-Tier Compilation
Case Study: TensorFlow XLA
Case Study: PyTorch JIT (TorchScript)
Hands-on Practical: Analyzing JIT Compiled Code
Chapter 8: Quantization and Low-Precision Optimizations
Fundamentals of Model Quantization (INT8, FP8)
Representing Quantized Operations in IR
Compiler Passes for Quantization-Aware Training (QAT)
Post-Training Quantization (PTQ) Compilation Flows
Generating Low-Precision Kernels
Mixed-Precision Computation Optimization
Handling Quantization Scales and Zero Points
Hands-on Practical: Lowering Quantized Operations
Chapter 9: Profiling and Performance Analysis Tools
Challenges in Profiling Compiled ML Code
System-Level Profiling (CPU, GPU, Interconnect)
CPU Performance Analysis (VTune, perf)
GPU Kernel Profiling (Nsight Compute, ROCprof)
Correlating Framework Operations to Compiled Kernels
Memory Access Pattern Analysis
Interpreting Compiler Optimization Reports
Hands-on Practical: Profiling an Optimized Model
Correlating Framework Operations to Compiled Kernels
Was this section helpful?
Helpful
Report Issue
Mark as Complete
© 2025 ApX Machine Learning
Correlating ML Framework Ops to Compiled Kernels