Having adapted a Large Language Model for a specific task, the focus shifts to making it practical and efficient for real-world application. This chapter addresses the steps of optimizing the model and preparing it for deployment.
You will examine methods for improving resource utilization during the fine-tuning process itself, including memory-saving techniques like gradient accumulation and mixed-precision training (e.g., using fp16 instead of fp32). We will cover strategies for accelerating training using distributed computing setups.
Following training, you'll study post-tuning optimization methods such as quantization (reducing weight precision, potentially to int8 or lower) and pruning to decrease model size and inference latency. The chapter also details the process of packaging the model, techniques for merging PEFT adapters back into the base model weights, selecting appropriate inference serving frameworks designed for LLMs, and establishing monitoring practices for deployed models.
7.1 Memory Optimization during Training
7.2 Accelerating Training with Distributed Strategies
7.3 Post-tuning Optimization: Quantization
7.4 Post-tuning Optimization: Pruning
7.5 Merging PEFT Adapters
7.6 Model Serialization and Packaging
7.7 Inference Serving Frameworks
7.8 Monitoring Fine-tuned Models in Production
© 2025 ApX Machine Learning