Metrics for ASR Performance: WER and CER

Evaluating the performance of an acoustic model is a critical step. While manually inspecting a few transcriptions might seem plausible, this approach is subjective and does not scale. Objective, reproducible metrics are essential for properly benchmarking speech recognition systems. Word Error Rate (WER) and Character Error Rate (CER) are two standard metrics used in speech recognition. These metrics quantify the difference between the model's predicted transcription (the hypothesis) and the correct, human-verified transcription (the reference).

Word Error Rate (WER)

Word Error Rate is the most common evaluation metric for ASR systems. It measures the number of errors at the word level, which aligns well with how humans perceive transcription accuracy for most languages. The calculation is based on the Levenshtein distance, which finds the minimum number of edits needed to change one sequence into another. For WER, these edits are:

Substitutions ( $S$ ): A word in the reference is replaced with a different word in the hypothesis. (e.g., weather becomes feather).
Deletions ( $D$ ): A word in the reference is missing from the hypothesis. (e.g., turn left becomes turn).
Insertions ( $I$ ): A word is present in the hypothesis but not in the reference. (e.g., go becomes uh go).

These counts are then normalized by the total number of words in the reference transcript ( $N$ ) to calculate the final WER. As the chapter introduction noted, the formula is:

\text{WER} = \frac{S + D + I}{N}

A lower WER indicates a better-performing model, with a WER of 0 indicating a perfect transcription.

Let's look at a practical example.

Reference: SHOW ME THE WEATHER ( $N=4$ )
Hypothesis: SHOW THE WEATHER NOW

To calculate the errors, we need to align the two sentences to find the minimum number of edits.

SHOW -> SHOW (Correct)
ME -> (Deletion, $D=1$ )
THE -> THE (Correct)
WEATHER -> WEATHER (Correct)
-> NOW (Insertion, $I=1$ )

In this case, we have $S=0$ , $D=1$ , and $I=1$ . The total number of words in the reference, $N$ , is 4.

\text{WER} = \frac{0 + 1 + 1}{4} = \frac{2}{4} = 0.5 \text{ or } 50\%

The following diagram illustrates this alignment process and the corresponding errors.

An alignment between a reference and hypothesis text, showing a deletion and an insertion. The total error count is 2, leading to a WER of 50%.

Character Error Rate (CER)

While WER is the default metric, it is less suitable for languages that are not whitespace-segmented, such as Mandarin or Japanese. It can also be misleading for tasks where individual character accuracy is significant, like transcribing proper nouns or alphanumeric codes. For these situations, Character Error Rate (CER) is a better choice.

The calculation is identical in principle to WER, but it operates on characters instead of words.

\text{CER} = \frac{S_{char} + D_{char} + I_{char}}{N_{char}}

Here, $S_{char}$ , $D_{char}$ , and $I_{char}$ are substitutions, deletions, and insertions at the character level, and $N_{char}$ is the total number of characters in the reference text.

For example:

Reference: HELLO ( $N_{char}=5$ )
Hypothesis: HALLOW

The alignment would show one insertion (A) and one substitution (O -> W).

$S_{char}=1$ , $D_{char}=0$ , $I_{char}=1$ .
$CER = (1 + 0 + 1) / 5 = 0.4$ or $40\%$ .

Interpreting and Using Error Rates

A common point of confusion is that WER can exceed 100%. This happens when the number of errors, particularly insertions, is greater than the number of words in the reference. Imagine a model that produces a very long, incorrect transcription for a short audio clip.

Reference: GO ( $N=1$ )
Hypothesis: PLEASE NO DON'T GO
Here, $S=0$ , $D=0$ , and $I=3$ .
$WER = (0 + 0 + 3) / 1 = 3.0$ or $300\%$ .

It is also important to remember that WER and CER are purely lexical metrics. They do not account for semantic meaning. A single substitution can drastically change the intent of a sentence, yet the WER penalty is the same as for a trivial error.

turn left -> turn right (1 substitution, massive semantic error)
turn left -> turn uh left (1 insertion, minor semantic error)

Despite these limitations, WER remains the universal standard for comparing ASR systems because it is simple, calculable, and provides a reliable benchmark for overall performance.

Calculating WER and CER in Python

You rarely need to implement the alignment algorithm yourself. Several Python libraries can compute these metrics efficiently. The jiwer library is a popular and easy-to-use option.

To use it, first install it with pip:

pip install jiwer

Then, you can use it to compute WER and get a detailed breakdown of the different error types.

import jiwer

# The reference and hypothesis strings
reference = "show me the weather"
hypothesis = "show the weather now"

# The jiwer.wer function gives the final WER score directly
error_rate = jiwer.wer(reference, hypothesis)
print(f"Word Error Rate: {error_rate:.2%}")

# The compute_measures function returns a dictionary with all counts
measures = jiwer.compute_measures(reference, hypothesis)

print(f"Substitutions: {measures['substitutions']}")
print(f"Deletions:     {measures['deletions']}")
print(f"Insertions:    {measures['insertions']}")
print(f"Reference words (N): {measures['truth']}")

Running this code produces the result we calculated manually:

Word Error Rate: 50.00%
Substitutions: 0
Deletions:     1
Insertions:    1
Reference words (N): 4

To calculate CER, you can use the same functions. Just make sure to pass character-separated strings if you want to be explicit, though jiwer handles it automatically if you tell it to.

# Calculate CER by working with characters
reference_chars = " ".join(list("hello")) # "h e l l o"
hypothesis_chars = " ".join(list("hallow")) # "h a l l o w"

cer_measures = jiwer.compute_measures(reference_chars, hypothesis_chars)
print(f"\nCharacter Error Rate: {cer_measures['wer']:.2%}")

Was this section helpful?

References

Speech and Language Processing (3rd ed. draft), Daniel Jurafsky and James H. Martin, 2025 - Provides comprehensive coverage of speech recognition fundamentals, including detailed explanations of Word Error Rate (WER), Levenshtein distance, and ASR evaluation methodologies.
jiwer Documentation, Nik Vaessen, 2025 (8x8) - Official documentation for the Python library used in the section for calculating Word Error Rate (WER) and Character Error Rate (CER), offering practical usage examples.
Automatic Speech Recognition: A Deep Learning Approach, Dong Yu and Li Deng, 2014 (Springer) DOI: 10.1007/978-1-4471-5779-3 - A detailed textbook providing insights into modern ASR systems based on deep learning, including discussions on evaluation metrics like WER in this context.