Large language models (LLMs) offer impressive capabilities, but their computational demands present significant challenges for deployment. This chapter addresses the 'why' behind the need for optimization. We'll examine the relationship between model size and resource requirements, often described by scaling laws such as Performance∝Nα, where N represents parameters or data size.
You will learn to identify key bottlenecks during inference, particularly memory bandwidth limitations and compute constraints. We will analyze how specific components of the Transformer architecture influence efficiency and introduce standard metrics used to evaluate model compression and speed. Furthermore, we'll survey the common hardware platforms (CPUs, GPUs, TPUs) used for LLMs and touch upon the theoretical trade-offs inherent in optimization efforts. By the end of this chapter, you'll have a solid understanding of the fundamental efficiency problems that motivate the techniques covered in the rest of this course.
1.1 Scaling Laws and Computational Costs of LLMs
1.2 Memory Bandwidth and Compute Bottlenecks in LLM Inference
1.3 Architectural Considerations for Efficiency
1.4 Metrics for Evaluating LLM Compression and Latency
1.5 Hardware Landscape for LLM Deployment
1.6 Theoretical Limits of Compression and Acceleration
© 2025 ApX Machine Learning