Building high-performance ASR and TTS models is only part of the process. Getting these models to run efficiently in actual applications presents its own set of technical problems. This chapter focuses on bridging the gap between model development and real-world deployment.
We will examine methods for optimizing models, including quantization, pruning, and knowledge distillation, to reduce their computational footprint (FLOPs) and memory requirements. We then shift to deployment strategies, discussing optimized runtimes like ONNX Runtime and TensorRT, and addressing the specific needs of streaming ASR and low-latency TTS. An overview of popular speech processing frameworks will also be provided to guide your implementation efforts.
6.1 Model Quantization for Speech Models
6.2 Model Pruning and Sparsification
6.3 Knowledge Distillation for ASR/TTS
6.4 Optimized Inference Engines (ONNX Runtime, TensorRT)
6.5 Deployment Considerations for Streaming ASR
6.6 Deployment Considerations for Real-Time TTS
6.7 Overview of Speech Processing Toolkits (ESPnet, NeMo, Coqui)
6.8 Practice: Optimizing a Speech Model
© 2025 ApX Machine Learning