Practice: Evaluating and Building a Demo Application

This hands-on session demonstrates the application of principles for evaluating ASR systems and utilizing tools for simple application development. The goal is to build a complete speech recognition project. This involves calculating the Word Error Rate (WER) for a set of predictions to quantitatively measure performance. Subsequently, the Gradio library is used to build a functional, interactive web demo for the speech recognition model.

Measuring Model Performance with Word Error Rate

Before deploying a model, it is important to have a clear, quantitative understanding of its accuracy. As we discussed, Word Error Rate (WER) is the industry standard for this. Let's calculate it using a Python library designed for this task.

First, you will need to install the jiwer library, a popular tool for calculating WER and other speech-to-text metrics.

pip install jiwer

Once installed, we can use it to compare a list of ground truth transcriptions with a list of hypotheses (predictions) generated by our model. For this example, let's assume we have already run our model on a test set and saved the results into two lists.

import jiwer

# Ground truth sentences from the test set
ground_truth = [
    "the quick brown fox jumps over the lazy dog",
    "this is a sample transcription",
    "speech recognition can be challenging"
]

# Predictions generated by our ASR model
predictions = [
    "the quick brown fox jumped over the lazy dog", # 1 substitution
    "this a sample transcription",                 # 1 deletion
    "speech recognition can be challenging to do"  # 2 insertions
]

# Calculate the WER and other metrics
measures = jiwer.compute_measures(ground_truth, predictions)

print(f"Word Error Rate (WER): {measures['wer']:.2f}")
print(f"Substitutions: {measures['substitutions']}")
print(f"Deletions: {measures['deletions']}")
print(f"Insertions: {measures['insertions']}")

Running this code will produce the following output, giving you a clear breakdown of the errors.

Word Error Rate (WER): 0.21
Substitutions: 1
Deletions: 1
Insertions: 2

The WER is calculated using the formula we saw earlier:

\text{WER} = \frac{S + D + I}{N} = \frac{1 + 1 + 2}{19} \approx 0.21

Here, $S=1$ ("jumps" -> "jumped"), $D=1$ ("is"), $I=2$ ("to", "do"), and the total number of words in the reference $N$ is $9+5+5=19$ .

The breakdown of errors is often as informative as the final WER score. A high number of substitutions might point to acoustic ambiguity, while a high number of insertions or deletions could suggest issues with the language model or the CTC decoder's behavior.

The count of each error type (substitutions, deletions, and insertions) that contributes to the overall Word Error Rate.

Building a Demo Application with Gradio

Calculating metrics is essential, but a live demo provides a way to interact with and show your model. We will use the Gradio library to create a simple web interface for our ASR model with very little code.

First, ensure you have Gradio installed. You will also need a library for inference, like transformers.

pip install gradio transformers

Next, we will structure our application. The process follows a simple pattern:

Load a pre-trained ASR model using a Hugging Face pipeline.
Define a Python function that takes an audio input and returns the transcribed text.
Pass this function and the desired input/output types to a Gradio Interface.

The diagram below illustrates this workflow. The user interacts with the Gradio UI, which captures audio. This audio is passed to our Python function, which uses the model to perform inference. The resulting text is then sent back to the UI for display.

The flow of data from user audio input through the Gradio interface to the ASR model and back to the user as transcribed text.

Here is the complete code to create and launch the application. We will use a pre-trained model from the Hugging Face Hub, openai/whisper-tiny, which is small and works well for demonstration purposes.

import gradio as gr
from transformers import pipeline

# 1. Load the ASR pipeline from Hugging Face
# This will download the model on the first run
asr_pipeline = pipeline("automatic-speech-recognition", model="openai/whisper-tiny")

# 2. Define the inference function
def transcribe_audio(audio):
    """
    Takes an audio file path (from Gradio input) and returns the transcribed text.
    Gradio handles the audio data format automatically.
    """
    if audio is None:
        return "No audio recorded. Please record your voice."

    # The pipeline returns a dictionary; we extract the text.
    result = asr_pipeline(audio)
    return result['text']

# 3. Create and launch the Gradio interface
demo = gr.Interface(
    fn=transcribe_audio,
    inputs=gr.Audio(source="microphone", type="filepath"),
    outputs=gr.Textbox(label="Transcription"),
    title="Applied Speech Recognition Demo",
    description="Record audio from your microphone and see the live transcription from the ASR model."
)

# Launch the app! A link to a local web server will be printed.
demo.launch()

When you run this Python script, Gradio will start a local web server and print a URL (e.g., http://127.0.0.1:7860). Open this link in your browser to see your ASR application live. You can grant the browser microphone access, record a short clip, and see the model's transcription appear in the output box.

With just a few lines of code, you have successfully built and deployed a functional speech-to-text application. This concludes the primary workflow of an ASR project, from data processing and model training to final evaluation and deployment. You now possess the foundational skills to build, test, and demonstrate modern speech recognition systems.

Was this section helpful?

References

Speech and Language Processing (3rd ed. draft), Daniel Jurafsky, James H. Martin, 2025 - Provides detailed explanations of Automatic Speech Recognition (ASR) evaluation metrics, including Word Error Rate (WER), in a widely recognized textbook.
jiwer: Compute Word Error Rate (WER) and other Automatic Speech Recognition (ASR) metrics - Documentation, Nik Vaessen, 2024 - The official documentation for the jiwer Python library, providing usage examples and technical specifications for calculating ASR performance metrics.
Transformers Library - Documentation, Hugging Face team, 2024 - The official documentation for the Hugging Face Transformers library, providing detailed guides on using pre-trained models and the pipeline API for tasks such as ASR.
Robust Speech Recognition via Large-Scale Weak Supervision, Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever, 2022 arXiv preprint arXiv:2212.04356 DOI: 10.48550/arXiv.2212.04356 - Introduces the Whisper model, a multi-task speech recognition model trained on a large and diverse dataset, which is utilized in the demo application for its high performance.