Evaluating the performance of an acoustic model is a critical step. While manually inspecting a few transcriptions might seem plausible, this approach is subjective and does not scale. Objective, reproducible metrics are essential for properly benchmarking speech recognition systems. Word Error Rate (WER) and Character Error Rate (CER) are two standard metrics used in speech recognition. These metrics quantify the difference between the model's predicted transcription (the hypothesis) and the correct, human-verified transcription (the reference).
Word Error Rate is the most common evaluation metric for ASR systems. It measures the number of errors at the word level, which aligns well with how humans perceive transcription accuracy for most languages. The calculation is based on the Levenshtein distance, which finds the minimum number of edits needed to change one sequence into another. For WER, these edits are:
weather becomes feather).turn left becomes turn).go becomes uh go).These counts are then normalized by the total number of words in the reference transcript () to calculate the final WER. As the chapter introduction noted, the formula is:
A lower WER indicates a better-performing model, with a WER of 0 indicating a perfect transcription.
Let's consider a practical example.
SHOW ME THE WEATHER ()SHOW THE WEATHER NOWTo calculate the errors, we need to align the two sentences to find the minimum number of edits.
SHOW -> SHOW (Correct)ME -> (Deletion, )THE -> THE (Correct)WEATHER -> WEATHER (Correct) -> NOW (Insertion, )In this case, we have , , and . The total number of words in the reference, , is 4.
The following diagram illustrates this alignment process and the corresponding errors.
An alignment between a reference and hypothesis text, showing a deletion and an insertion. The total error count is 2, leading to a WER of 50%.
While WER is the default metric, it is less suitable for languages that are not whitespace-segmented, such as Mandarin or Japanese. It can also be misleading for tasks where individual character accuracy is significant, like transcribing proper nouns or alphanumeric codes. For these situations, Character Error Rate (CER) is a better choice.
The calculation is identical in principle to WER, but it operates on characters instead of words.
Here, , , and are substitutions, deletions, and insertions at the character level, and is the total number of characters in the reference text.
For example:
HELLO ()HALLOWThe alignment would show one insertion (A) and one substitution (O -> W).
A common point of confusion is that WER can exceed 100%. This happens when the number of errors, particularly insertions, is greater than the number of words in the reference. Imagine a model that produces a very long, incorrect transcription for a short audio clip.
GO ()PLEASE NO DON'T GOIt is also important to remember that WER and CER are purely lexical metrics. They do not account for semantic meaning. A single substitution can drastically change the intent of a sentence, yet the WER penalty is the same as for a trivial error.
turn left -> turn right (1 substitution, massive semantic error)turn left -> turn uh left (1 insertion, minor semantic error)Despite these limitations, WER remains the universal standard for comparing ASR systems because it is simple, calculable, and provides a reliable benchmark for overall performance.
You rarely need to implement the alignment algorithm yourself. Several Python libraries can compute these metrics efficiently. The jiwer library is a popular and easy-to-use option.
To use it, first install it with pip:
pip install jiwer
Then, you can use it to compute WER and get a detailed breakdown of the different error types.
import jiwer
# The reference and hypothesis strings
reference = "show me the weather"
hypothesis = "show the weather now"
# The jiwer.wer function gives the final WER score directly
error_rate = jiwer.wer(reference, hypothesis)
print(f"Word Error Rate: {error_rate:.2%}")
# The compute_measures function returns a dictionary with all counts
measures = jiwer.compute_measures(reference, hypothesis)
print(f"Substitutions: {measures['substitutions']}")
print(f"Deletions: {measures['deletions']}")
print(f"Insertions: {measures['insertions']}")
print(f"Reference words (N): {measures['truth']}")
Running this code produces the result we calculated manually:
Word Error Rate: 50.00%
Substitutions: 0
Deletions: 1
Insertions: 1
Reference words (N): 4
To calculate CER, you can use the same functions. Just make sure to pass character-separated strings if you want to be explicit, though jiwer handles it automatically if you tell it to.
# Calculate CER by working with characters
reference_chars = " ".join(list("hello")) # "h e l l o"
hypothesis_chars = " ".join(list("hallow")) # "h a l l o w"
cer_measures = jiwer.compute_measures(reference_chars, hypothesis_chars)
print(f"\nCharacter Error Rate: {cer_measures['wer']:.2%}")
Was this section helpful?
© 2026 ApX Machine LearningEngineered with