Calculating Word Error Rate

While the Word Error Rate (WER) formula is straightforward, the process of determining the counts for substitutions ( $S$ ), deletions ( $D$ ), and insertions ( $I$ ) requires a precise method. You cannot simply count word differences, you must find the minimum number of edits required to transform the model's output (the hypothesis) into the correct text (the reference). This is a classic sequence alignment problem, solved using an algorithm that calculates the Levenshtein distance.

The Levenshtein Distance for ASR

The Levenshtein distance measures the difference between two sequences. In our case, the sequences are the words in the reference and hypothesis transcripts. The distance is defined as the minimum number of single-word edits needed to change one sequence into the other. These edits are exactly the three error types we need to calculate WER:

Substitution (S): Replacing one word with another.
Deletion (D): Removing a word that should not be there.
Insertion (I): Adding a word that was missed.

An algorithm, often based on dynamic programming, finds the optimal alignment between the two word sequences that results in the lowest possible combined count of these three errors. Let's walk through an example to make this clear.

Suppose our reference and hypothesis are:

Reference: the quick brown fox
Hypothesis: the fast brown fox jumped

To find $S$ , $D$ , and $I$ , we align them to minimize the edit distance:

The alignment process maps words from the hypothesis to the reference to identify errors. In this case, "quick" is substituted with "fast", and "jumped" is inserted.

From this alignment, we can count the errors:

Substitutions (S): 1 (the model transcribed quick as fast).
Deletions (D): 0 (the model did not miss any words from the reference).
Insertions (I): 1 (the model added the word jumped).

The total number of words in the reference transcript ( $N$ ) is 4. Now we can calculate the WER:

\text{WER} = \frac{S + D + I}{N} = \frac{1 + 0 + 1}{4} = \frac{2}{4} = 0.5

This gives us a WER of 50%.

A Note on WER Values

It is important to know that the WER can exceed 1.0, or 100%. This occurs if the total number of errors is greater than the number of words in the reference. This can happen if the model produces an output that is significantly longer than the reference, leading to a high number of insertions. For instance, if the reference is start recording ( $N=2$ ) and the hypothesis is start recording start recording start recording, the WER would be (0S + 0D + 4I) / 2 = 2.0, or 200%.

Calculating WER in Python with `jiwer`

Manually implementing the alignment algorithm is unnecessary, as well-established libraries can handle it for you. The jiwer library is a popular and effective tool for this purpose.

First, you will need to install it:

pip install jiwer

Then, you can use its compute_measures function to get a comprehensive breakdown of the errors and the final WER score. This function takes the reference and hypothesis strings as input and returns a dictionary containing all the relevant metrics.

import jiwer

# The ground truth text
reference = "the quick brown fox"

# The ASR model's output
hypothesis = "the fast brown fox jumped"

# Calculate all metrics
error_report = jiwer.compute_measures(reference, hypothesis)

# Extract the individual components
wer = error_report['wer']
substitutions = error_report['substitutions']
deletions = error_report['deletions']
insertions = error_report['insertions']
hits = error_report['hits'] # Correctly transcribed words

print(f"Reference:  '{reference}'")
print(f"Hypothesis: '{hypothesis}'\n")

print(f"Word Error Rate (WER): {wer:.2%}")
print(f"Substitutions: {substitutions}")
print(f"Deletions: {deletions}")
print(f"Insertions: {insertions}")
print(f"Correct Words (Hits): {hits}")

Running this code will produce the following output, confirming our manual calculation:

Reference:  'the quick brown fox'
Hypothesis: 'the fast brown fox jumped'

Word Error Rate (WER): 50.00%
Substitutions: 1
Deletions: 0
Insertions: 1
Correct Words (Hits): 3

Using a library like jiwer ensures your calculations are consistent, accurate, and follow the standard alignment algorithm. When evaluating your models, you will typically calculate the average WER across an entire test dataset, not just on a single sentence, to get a reliable measure of overall system performance.

Was this section helpful?

References

Speech and Language Processing, Daniel Jurafsky and James H. Martin, 2025 (Stanford University Online (Draft of 3rd Edition)) - Definitive textbook for speech and language processing, covering ASR evaluation, WER, and Levenshtein distance algorithms in detail.
Introduction to Algorithms, Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein, 2022 (MIT Press) - Standard algorithms textbook providing comprehensive theoretical background on dynamic programming and edit distance algorithms, fundamental for sequence alignment.
jiwer: Python package for computing the Word Error Rate (WER), Max de Groot and contributors, 2024 - Official GitHub repository for the jiwer Python library, providing tools for accurate Word Error Rate (WER) calculation.

Calculating Word Error Rate

The Levenshtein Distance for ASR

A Note on WER Values

Calculating WER in Python with jiwer

Calculating WER in Python with `jiwer`