Moving from model training to production use requires deploying the large language model and serving inference requests efficiently. This chapter concentrates on the operational aspects of this transition, focusing on the unique scale and resource demands of LLMs.
You will examine techniques for:
The objective is to provide practical approaches for building performant, scalable, and cost-aware LLM serving systems.
4.1 Challenges in Serving Large Models
4.2 Model Packaging and Containerization for LLMs
4.3 GPU Inference Server Optimization
4.4 Implementing Model Quantization Techniques
4.5 Knowledge Distillation for Deployment
4.6 Advanced Deployment Patterns (Canary, A/B Testing)
4.7 Autoscaling Inference Endpoints
4.8 Serverless GPU Inference Considerations
4.9 Practice: Deploying a Quantized Model
© 2025 ApX Machine Learning