The decoder’s job is to find the single sequence of words ($W$) that best matches the input audio ($O$). It does this by combining scores from the acoustic model, $P(O|W)$, and the language model, $P(W)$. The challenge is that the number of possible word sequences is unimaginably large. A ten-second audio clip could correspond to millions, if not billions, of potential sentences.Checking every single possibility, a method known as a "brute-force search," is computationally impossible. If you had a vocabulary of 20,000 words, the number of possible 5-word sentences would be $20,000^5$, a number so large that even the fastest computers could not check them all in a reasonable amount of time. The decoder needs a much more efficient strategy.The Search Problem as a Pathfinding TaskA better way to approach this is to reframe it as a pathfinding problem. Imagine a giant graph where each point represents a possible word that could follow the previous one. The decoder's task is to find the "cheapest" or "most likely" path from a start point to an end point through this graph.Each path through the graph represents a unique sentence, or a "hypothesis." Each step along a path has an associated cost, which is calculated from the acoustic and language model probabilities. A path with a low acoustic model score (meaning the sounds don't match well) or a low language model score (meaning the word sequence is unlikely) will become very "expensive" quickly.digraph G { rankdir=LR; node [shape=circle, style=filled, fillcolor="#a5d8ff", fontname="sans-serif", color="#4263eb"]; edge [fontname="sans-serif", color="#868e96"]; splines=true; START [label="Start", shape=doublecircle, fillcolor="#b2f2bb"]; END [label="End", shape=doublecircle, fillcolor="#b2f2bb"]; subgraph cluster_0 { style=invis; START -> R [label=" 'rec' sound "]; R [label="rec..."]; } subgraph cluster_1 { style=invis; R -> OGN; R -> K; OGN [label="...ognize"]; K [label="...k"]; } subgraph cluster_2 { style=invis; OGN -> SPEECH; K -> A; SPEECH [label="speech"]; A [label="a"]; } subgraph cluster_3 { style=invis; A -> NICE; NICE [label="nice"]; } subgraph cluster_4 { style=invis; NICE -> BEACH; BEACH [label="beach"]; } subgraph cluster_5 { style=invis; SPEECH -> END; BEACH -> END; } // Highlight the "winning" path R -> OGN [color="#f03e3e", penwidth=2.0]; OGN -> SPEECH [color="#f03e3e", penwidth=2.0]; SPEECH -> END [color="#f03e3e", penwidth=2.0]; }A simplified search graph. The decoder checks different paths (hypotheses), such as "recognize speech" and "wreck a nice beach." The red path indicates the one with the best combined score from the acoustic and language models.Pruning Unlikely HypothesesInstead of blindly exploring all paths, search algorithms use an intelligent technique called pruning. As the decoder builds up sentence hypotheses word by word, it keeps track of their scores. If one path becomes significantly less likely than the best-known path so far, the algorithm simply abandons it.This is a form of "beam search," where the decoder only keeps a small number, or "beam," of the most promising hypotheses at each step and discards the rest. By pruning away the majority of bad paths early, the decoder can focus its computational effort on the hypotheses that actually have a chance of being correct.This process is what makes modern speech recognition possible. Without efficient search algorithms that can navigate a massive space of possibilities, even the best acoustic and language models would be useless. In the next section, you'll learn about the Viterbi algorithm, a classic and effective algorithm that forms the basis for many decoders.