Chapter 6: Evaluating and Deploying ASR Systems

Having constructed several acoustic models, the next steps are to measure their effectiveness and put them into practice. This chapter focuses on these final stages of the ASR development cycle: evaluation and deployment.

First, you will learn how to quantitatively assess an ASR system's performance. We will cover the industry-standard metrics, Word Error Rate (WER) and Character Error Rate (CER). You will see how WER is calculated from the number of substitutions ( $S$ ), deletions ( $D$ ), and insertions ( $I$ ) relative to the total number of words ( $N$ ) in a reference transcript:

\text{WER} = \frac{S + D + I}{N}

Following evaluation, we will look at a common method for improving model generalization through audio data augmentation. The chapter then transitions from theory to application. You will work with the Hugging Face pipeline for straightforward inference and then use the Gradio library to build an interactive web interface for your model. To conclude, we will discuss the architectural requirements for systems that process streaming audio.

Sections

6.1 Metrics for ASR Performance: WER and CER
6.2 Calculating Word Error Rate
6.3 Common Data Augmentation Techniques for Speech
6.4 Using Hugging Face Pipelines for ASR
6.5 Building a Speech-to-Text Application with Gradio
6.6 Considerations for Real-time Streaming ASR
6.7 Practice: Evaluating and Building a Demo Application