After developing and training complex PyTorch models, the focus shifts to preparing them for practical use. This chapter addresses model deployment and performance optimization, providing methods to make your models faster, smaller, and more resource-efficient during inference.
We will cover model serialization using TorchScript, exploring both tracing and scripting approaches. You'll learn model compression techniques, including quantization (static, dynamic, and quantization-aware training) and pruning strategies to reduce model size and computational requirements. We will utilize the PyTorch Profiler for identifying performance bottlenecks in CPU and GPU execution. Furthermore, you'll study exporting models to the ONNX format for broader compatibility and learn to serve models efficiently using TorchServe.
By the end of this chapter, you'll have gained practical skills in analyzing model performance and applying various optimization techniques essential for transitioning PyTorch models from development to production environments.
4.1 TorchScript Fundamentals: Tracing vs Scripting
4.2 Model Quantization Techniques
4.3 Model Pruning Strategies
4.4 Performance Analysis with PyTorch Profiler
4.5 Optimizing Kernels with External Libraries
4.6 Exporting Models to ONNX Format
4.7 Serving Models with TorchServe
4.8 Practice: Profiling and Quantizing a Model
© 2025 ApX Machine Learning