While the Word Error Rate (WER) formula is straightforward, the process of determining the counts for substitutions (), deletions (), and insertions () requires a precise method. You cannot simply count word differences, you must find the minimum number of edits required to transform the model's output (the hypothesis) into the correct text (the reference). This is a classic sequence alignment problem, solved using an algorithm that calculates the Levenshtein distance.
The Levenshtein distance measures the difference between two sequences. In our case, the sequences are the words in the reference and hypothesis transcripts. The distance is defined as the minimum number of single-word edits needed to change one sequence into the other. These edits are exactly the three error types we need to calculate WER:
An algorithm, often based on dynamic programming, finds the optimal alignment between the two word sequences that results in the lowest possible combined count of these three errors. Let's walk through an example to make this clear.
Suppose our reference and hypothesis are:
the quick brown foxthe fast brown fox jumpedTo find , , and , we align them to minimize the edit distance:
The alignment process maps words from the hypothesis to the reference to identify errors. In this case, "quick" is substituted with "fast", and "jumped" is inserted.
From this alignment, we can count the errors:
quick as fast).jumped).The total number of words in the reference transcript () is 4. Now we can calculate the WER:
This gives us a WER of 50%.
It is important to know that the WER can exceed 1.0, or 100%. This occurs if the total number of errors is greater than the number of words in the reference. This can happen if the model produces an output that is significantly longer than the reference, leading to a high number of insertions. For instance, if the reference is start recording () and the hypothesis is start recording start recording start recording, the WER would be (0S + 0D + 4I) / 2 = 2.0, or 200%.
jiwerManually implementing the alignment algorithm is unnecessary, as well-established libraries can handle it for you. The jiwer library is a popular and effective tool for this purpose.
First, you will need to install it:
pip install jiwer
Then, you can use its compute_measures function to get a comprehensive breakdown of the errors and the final WER score. This function takes the reference and hypothesis strings as input and returns a dictionary containing all the relevant metrics.
import jiwer
# The ground truth text
reference = "the quick brown fox"
# The ASR model's output
hypothesis = "the fast brown fox jumped"
# Calculate all metrics
error_report = jiwer.compute_measures(reference, hypothesis)
# Extract the individual components
wer = error_report['wer']
substitutions = error_report['substitutions']
deletions = error_report['deletions']
insertions = error_report['insertions']
hits = error_report['hits'] # Correctly transcribed words
print(f"Reference: '{reference}'")
print(f"Hypothesis: '{hypothesis}'\n")
print(f"Word Error Rate (WER): {wer:.2%}")
print(f"Substitutions: {substitutions}")
print(f"Deletions: {deletions}")
print(f"Insertions: {insertions}")
print(f"Correct Words (Hits): {hits}")
Running this code will produce the following output, confirming our manual calculation:
Reference: 'the quick brown fox'
Hypothesis: 'the fast brown fox jumped'
Word Error Rate (WER): 50.00%
Substitutions: 1
Deletions: 0
Insertions: 1
Correct Words (Hits): 3
Using a library like jiwer ensures your calculations are consistent, accurate, and follow the standard alignment algorithm. When evaluating your models, you will typically calculate the average WER across an entire test dataset, not just on a single sentence, to get a reliable measure of overall system performance.
Was this section helpful?
jiwer Python library, providing tools for accurate Word Error Rate (WER) calculation.© 2026 ApX Machine LearningEngineered with