The search for the best word sequence presents a significant computational challenge. Given an audio input, the number of potential sentences is astronomically large. A brute-force approach, where the decoder calculates the probability for every possible sentence, is simply not feasible. For a vocabulary of 50,000 words, even a short 10-word sentence has $50,000^{10}$ potential combinations. We need a much more efficient method to navigate this enormous search space.This is where the Viterbi algorithm comes in. It is a highly efficient algorithm from a family of techniques known as dynamic programming. The central idea of dynamic programming is to solve a complex problem by breaking it down into a collection of simpler subproblems, solving each of those subproblems just once, and storing their solutions. The Viterbi algorithm applies this principle to find the most probable sequence of states in a model, which in our case corresponds to the most likely sequence of words.A Path Through TimeImagine the decoding process as finding the best path through a grid, often called a trellis or a lattice. Each column in the grid represents a point in time (corresponding to a frame or segment of audio), and each node in a column represents a possible word (or phoneme) that could be spoken at that time.The algorithm moves through the trellis one time step at a time, from left to right. At each step, it calculates the most probable path to get to every possible word. The critical insight of the Viterbi algorithm is this: to find the best path to a word at the current time step, you only need to know the best paths to all the words at the previous time step. You don't need to remember the entire history of every possible path.By extending the best paths from the previous step and adding the new costs (based on acoustic and language model probabilities), the algorithm finds the new set of best paths. It then discards all other, less likely paths. This continuous pruning of improbable paths is what makes the algorithm so efficient.An Example WalkthroughLet’s illustrate with a simplified example. Suppose after hearing some audio, the decoder is trying to decide between two partial sentences: "recognize speech" and "wreck a nice beach". The acoustic model might find both phrases sound similar. The language model, however, knows that "recognize speech" is a much more common phrase than "wreck a nice speech".The Viterbi algorithm would proceed as follows:Time Step 1: It considers the first part of the audio. The acoustic model gives a high score to both "recognize" and "wreck a nice". The algorithm keeps both as possibilities.Time Step 2: It analyzes the next segment of audio, which sounds like "speech". The decoder now evaluates four potential paths:recognize -> speechrecognize -> beachwreck a nice -> speechwreck a nice -> beachCalculation and Pruning: For each of these four paths, the decoder calculates a total score by combining the acoustic score (how well the audio matches the word) and the language model score (how likely the word sequence is).P(audio | "recognize speech") * P("recognize speech") will likely have a high score.P(audio | "wreck a nice beach") * P("wreck a nice beach") will also have a reasonably high score.However, paths like recognize -> beach and wreck a nice -> speech will have very low language model probabilities. The algorithm assigns them low scores and effectively prunes them from consideration.Backtracking: After reaching the end of the audio, the algorithm identifies the final word of the path with the highest overall probability. From there, it traces backward through the saved pointers at each time step to reconstruct the single best sequence of words.The following diagram shows this process. At each time step, the algorithm extends paths from the previous step's survivors. Less likely paths (dashed lines) are discarded, and only the most probable path to each word is retained. The final best path is found by tracing back from the end.digraph G { rankdir=LR; splines=true; node [shape=circle, style=filled, fontname="sans-serif", fixedsize=true, width=1.1, height=1.1]; edge [fontname="sans-serif", fontsize=10]; graph [fontname="sans-serif", label="Viterbi Path Search", labelloc=t, fontsize=14]; subgraph cluster_0 { label="Start"; style=invis; start [label="start", shape=plaintext, fontsize=12]; } subgraph cluster_1 { label="Hypothesis at Time 1"; bgcolor="#e9ecef"; style=filled; rec [label="recognize", fillcolor="#a5d8ff"]; wreck [label="wreck a nice", fillcolor="#a5d8ff"]; } subgraph cluster_2 { label="Hypothesis at Time 2"; bgcolor="#e9ecef"; style=filled; speech [label="speech", fillcolor="#a5d8ff"]; beach [label="beach", fillcolor="#a5d8ff"]; } subgraph cluster_3 { label="End"; style=invis; end_node [label="end", shape=plaintext, fontsize=12]; } start -> rec [label=" P(rec|start)"]; start -> wreck [label=" P(wreck|start)"]; rec -> speech [label=" P(speech|rec)", penwidth=2.5, color="#fd7e14"]; rec -> beach [label=" P(beach|rec)", style=dashed, color="#adb5bd"]; wreck -> speech [label=" P(speech|wreck)", style=dashed, color="#adb5bd"]; wreck -> beach [label=" P(beach|wreck)", penwidth=2.5, color="#fd7e14"]; speech -> end_node [penwidth=2.5, color="#fd7e14"]; beach -> end_node [style=dashed, color="#adb5bd"]; {rank=same; start;} {rank=same; rec; wreck;} {rank=same; speech; beach;} {rank=same; end_node;} }A simplified lattice showing the Viterbi search. The algorithm evaluates transitions based on combined acoustic and language model scores. Improbable paths are pruned (dashed gray lines), while the most likely path (orange) is constructed by backtracking from the highest-scoring final state.Why It WorksThe Viterbi algorithm is guaranteed to find the single best path through the trellis. It does this without having to evaluate every single possibility. By making locally optimal decisions at each time step (i.e., keeping only the best path to each word), it arrives at the globally optimal solution. This transformation of an intractable problem into a manageable one is a foundation of how modern ASR systems produce transcriptions quickly and accurately.