This hands-on session demonstrates the application of principles for evaluating ASR systems and utilizing tools for simple application development. The goal is to build a complete speech recognition project. This involves calculating the Word Error Rate (WER) for a set of predictions to quantitatively measure performance. Subsequently, the Gradio library is used to build a functional, interactive web demo for the speech recognition model.
Before deploying a model, it is important to have a clear, quantitative understanding of its accuracy. As we discussed, Word Error Rate (WER) is the industry standard for this. Let's calculate it using a Python library designed for this task.
First, you will need to install the jiwer library, a popular tool for calculating WER and other speech-to-text metrics.
pip install jiwer
Once installed, we can use it to compare a list of ground truth transcriptions with a list of hypotheses (predictions) generated by our model. For this example, let's assume we have already run our model on a test set and saved the results into two lists.
import jiwer
# Ground truth sentences from the test set
ground_truth = [
"the quick brown fox jumps over the lazy dog",
"this is a sample transcription",
"speech recognition can be challenging"
]
# Predictions generated by our ASR model
predictions = [
"the quick brown fox jumped over the lazy dog", # 1 substitution
"this a sample transcription", # 1 deletion
"speech recognition can be challenging to do" # 2 insertions
]
# Calculate the WER and other metrics
measures = jiwer.compute_measures(ground_truth, predictions)
print(f"Word Error Rate (WER): {measures['wer']:.2f}")
print(f"Substitutions: {measures['substitutions']}")
print(f"Deletions: {measures['deletions']}")
print(f"Insertions: {measures['insertions']}")
Running this code will produce the following output, giving you a clear breakdown of the errors.
Word Error Rate (WER): 0.21
Substitutions: 1
Deletions: 1
Insertions: 2
The WER is calculated using the formula we saw earlier:
WER=NS+D+I=191+1+2≈0.21Here, S=1 ("jumps" -> "jumped"), D=1 ("is"), I=2 ("to", "do"), and the total number of words in the reference N is 9+5+5=19.
The breakdown of errors is often as informative as the final WER score. A high number of substitutions might point to acoustic ambiguity, while a high number of insertions or deletions could suggest issues with the language model or the CTC decoder's behavior.
The count of each error type (substitutions, deletions, and insertions) that contributes to the overall Word Error Rate.
Calculating metrics is essential, but a live demo provides a way to interact with and show your model. We will use the Gradio library to create a simple web interface for our ASR model with very little code.
First, ensure you have Gradio installed. You will also need a library for inference, like transformers.
pip install gradio transformers
Next, we will structure our application. The process follows a simple pattern:
pipeline.Interface.The diagram below illustrates this workflow. The user interacts with the Gradio UI, which captures audio. This audio is passed to our Python function, which uses the model to perform inference. The resulting text is then sent back to the UI for display.
The flow of data from user audio input through the Gradio interface to the ASR model and back to the user as transcribed text.
Here is the complete code to create and launch the application. We will use a pre-trained model from the Hugging Face Hub, openai/whisper-tiny, which is small and works well for demonstration purposes.
import gradio as gr
from transformers import pipeline
# 1. Load the ASR pipeline from Hugging Face
# This will download the model on the first run
asr_pipeline = pipeline("automatic-speech-recognition", model="openai/whisper-tiny")
# 2. Define the inference function
def transcribe_audio(audio):
"""
Takes an audio file path (from Gradio input) and returns the transcribed text.
Gradio handles the audio data format automatically.
"""
if audio is None:
return "No audio recorded. Please record your voice."
# The pipeline returns a dictionary; we extract the text.
result = asr_pipeline(audio)
return result['text']
# 3. Create and launch the Gradio interface
demo = gr.Interface(
fn=transcribe_audio,
inputs=gr.Audio(source="microphone", type="filepath"),
outputs=gr.Textbox(label="Transcription"),
title="Applied Speech Recognition Demo",
description="Record audio from your microphone and see the live transcription from the ASR model."
)
# Launch the app! A link to a local web server will be printed.
demo.launch()
When you run this Python script, Gradio will start a local web server and print a URL (e.g., http://127.0.0.1:7860). Open this link in your browser to see your ASR application live. You can grant the browser microphone access, record a short clip, and see the model's transcription appear in the output box.
With just a few lines of code, you have successfully built and deployed a functional speech-to-text application. This concludes the primary workflow of an ASR project, from data processing and model training to final evaluation and deployment. You now possess the foundational skills to build, test, and demonstrate modern speech recognition systems.
Was this section helpful?
jiwer Python library, providing usage examples and technical specifications for calculating ASR performance metrics.pipeline API for tasks such as ASR.© 2026 ApX Machine LearningEngineered with