Optimizing a large language model through techniques like quantization or pruning addresses model size and computational complexity. However, achieving significant inference speedups requires effective interaction between the model and the underlying hardware and system software. This chapter focuses on the methods used to accelerate LLM execution by tailoring computations to specific hardware capabilities and optimizing the surrounding system components.
You will learn about:
Understanding these hardware and systems-level optimizations helps bridge the gap between a theoretically compressed model and a practically fast deployment.
6.1 Mapping LLM Operations to Hardware Architectures
6.2 Memory Management Techniques for Large Models
6.3 Optimized Kernels for LLM Layers
6.4 Compiler Optimizations for LLMs
6.5 Distributed Inference Strategies
6.6 Advanced Inference Optimization Algorithms
6.7 Benchmarking LLM Performance on Diverse Hardware
6.8 Hands-on Practical: Optimizing Inference with Runtimes
© 2025 ApX Machine Learning