A fundamental challenge in speech recognition is that the length of the input audio features does not match the length of the output text transcription. For instance, an 800-millisecond audio clip might produce thousands of feature vectors, but the corresponding transcription could be a short phrase like "OK Google". Furthermore, there is no explicit, frame-by-frame alignment telling the model which part of the audio corresponds to which character. A standard cross-entropy loss function, which requires a one-to-one mapping between inputs and targets, is not suitable for this task.
This is the problem that Connectionist Temporal Classification (CTC) loss is designed to solve. It is a loss function that allows a neural network to be trained on sequence-to-sequence tasks where the alignment between the input and output is unknown.
Consider an acoustic model, such as an LSTM, that processes a sequence of audio features. For each feature vector at each time step t, the network outputs a probability distribution over all possible characters in our vocabulary. If our input has T time steps, we will get T probability distributions.
Now, imagine our target transcription is the word "CAT". It has a length N=3. The input feature sequence has a length T, which is almost certainly much larger than 3. How do we calculate the loss? Which of the T output steps should correspond to "C", which to "A", and which to "T"? What about the time steps that correspond to the silence between words or the elongated sound of a vowel?
CTC cleverly resolves this by augmenting the vocabulary with a special blank token, often represented as <B> or ε. The model is now allowed to predict this blank token at any time step.
With the blank token, the network's output sequence now has the same length as the input feature sequence, T. We can then define a simple set of rules to transform this longer output sequence (called a "path") into the final, shorter transcription.
The collapsing process works as follows:
C-C-A-A-A-T becomes C-A-T.C-<B>-A-A-<B>-T becomes C-A-A-T.Let's combine these rules. If the network outputs the path <B>-C-C-<B>-A-T-T, the transformation would be:
<B>-C-C-<B>-A-T-T<B>-C-<B>-A-TC-A-TThe blank token is significant because it separates characters that should be distinct but are repeated. For instance, in the word "HELLO", the two "L"s must be separated by a blank in the network's path to avoid being merged into a single "L". A valid path could be H-E-L-<B>-L-O. A path like H-E-L-L-O would incorrectly collapse to "HELO".
This many-to-one mapping means numerous paths can result in the same final transcription. The following diagram illustrates how different paths from the network's output layer can all collapse to the target "CAT".
Different network output sequences (paths) that correctly decode to the target transcription "CAT" after applying the CTC collapsing rules.
The core idea of the CTC loss function is to sum the probabilities of all possible paths that correctly map to the target transcription. Let's denote the target transcription as Y and the input feature sequence as X.
Path Probability: At each time step t, the network (e.g., an LSTM) outputs a probability distribution pt(k) for each token k in the vocabulary (characters plus the blank token). The probability of a single path π of length T is the product of the probabilities of the tokens at each time step:
P(π∣X)=t=1∏Tpt(πt)where πt is the token in path π at time step t.
Total Target Probability: The total probability of the target transcription Y is the sum of the probabilities of all paths π that can be collapsed into Y.
P(Y∣X)=π∈valid paths∑P(π∣X)The Loss Function: The CTC loss is the negative log-likelihood of this total probability. By minimizing this loss, we are maximizing the probability that the network will output a path that decodes to the correct text.
LCTC=−logP(Y∣X)Calculating this sum across all possible paths seems computationally intractable, as the number of paths grows exponentially with the sequence length. However, this summation can be calculated efficiently using a dynamic programming approach known as the forward-backward algorithm, which is similar to algorithms used in Hidden Markov Models (HMMs). Fortunately, deep learning frameworks like TensorFlow and PyTorch provide optimized implementations, so you only need to call the function without worrying about the underlying complexity.
Once a model is trained with CTC loss, we can use it for inference to convert new audio into text. The network produces a matrix of probabilities of size T \times (\text{num_characters} + 1). The process of generating text from this matrix is called decoding.
The simplest decoding method is greedy decoding (also known as best path decoding). At each time step t, we simply select the token with the highest probability. This gives us a single, most likely path. We then collapse this path using the CTC rules.
For example, if the argmax at each time step yields the path H-H-<B>-E-L-L-<B>-L-O-O, greedy decoding would collapse this to:
H-H-<B>-E-L-L-<B>-L-O-O -> H-<B>-E-L-<B>-L-O (merge repeats)H-<B>-E-L-<B>-L-O -> HELO (remove blanks)While fast, greedy decoding is not optimal because the most probable path might not result in the most probable transcription. The sum of probabilities of many "good" paths could be higher than the probability of the single "best" path. More sophisticated algorithms like beam search, which we will discuss in Chapter 5, can achieve better results by exploring a larger set of high-probability paths during decoding.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with