Connectionist Temporal Classification (CTC) Loss

A fundamental challenge in speech recognition is that the length of the input audio features does not match the length of the output text transcription. For instance, an 800-millisecond audio clip might produce thousands of feature vectors, but the corresponding transcription could be a short phrase like "OK Google". Furthermore, there is no explicit, frame-by-frame alignment telling the model which part of the audio corresponds to which character. A standard cross-entropy loss function, which requires a one-to-one mapping between inputs and targets, is not suitable for this task.

This is the problem that Connectionist Temporal Classification (CTC) loss is designed to solve. It is a loss function that allows a neural network to be trained on sequence-to-sequence tasks where the alignment between the input and output is unknown.

The Alignment Problem in Detail

For example, an acoustic model such as an LSTM processes a sequence of audio features. For each feature vector at each time step $t$ , the network outputs a probability distribution over all possible characters in our vocabulary. If our input has $T$ time steps, we will get $T$ probability distributions.

Now, imagine our target transcription is the word "CAT". It has a length $N=3$ . The input feature sequence has a length $T$ , which is almost certainly much larger than 3. How do we calculate the loss? Which of the $T$ output steps should correspond to "C", which to "A", and which to "T"? What about the time steps that correspond to the silence between words or the elongated sound of a vowel?

CTC cleverly resolves this by augmenting the vocabulary with a special blank token, often represented as  or ε. The model is now allowed to predict this blank token at any time step.

From Network Outputs to Transcriptions

With the blank token, the network's output sequence now has the same length as the input feature sequence, $T$ . We can then define a simple set of rules to transform this longer output sequence (called a "path") into the final, shorter transcription.

The collapsing process works as follows:

Merge Repeats: First, collapse any sequences of identical, consecutive characters into a single character. For example, C-C-A-A-A-T becomes C-A-T.
Remove Blanks: Second, remove all blank tokens from the sequence. For example, C--A-A--T becomes C-A-A-T.

Let's combine these rules. If the network outputs the path -C-C--A-T-T, the transformation would be:

Initial Path: -C-C--A-T-T
Step 1 (Merge Repeats): -C--A-T
Step 2 (Remove Blanks): C-A-T

The blank token is significant because it separates characters that should be distinct but are repeated. For instance, in the word "HELLO", the two "L"s must be separated by a blank in the network's path to avoid being merged into a single "L". A valid path could be H-E-L--L-O. A path like H-E-L-L-O would incorrectly collapse to "HELO".

This many-to-one mapping means numerous paths can result in the same final transcription. The following diagram illustrates how different paths from the network's output layer can all collapse to the target "CAT".

Different network output sequences (paths) that correctly decode to the target transcription "CAT" after applying the CTC collapsing rules.

The CTC Loss Calculation

The core idea of the CTC loss function is to sum the probabilities of all possible paths that correctly map to the target transcription. Let's denote the target transcription as $Y$ and the input feature sequence as $X$ .

Path Probability: At each time step $t$ , the network (e.g., an LSTM) outputs a probability distribution $p_t(k)$ for each token $k$ in the vocabulary (characters plus the blank token). The probability of a single path $\pi$ of length $T$ is the product of the probabilities of the tokens at each time step:
$P(\pi | X) = \prod_{t=1}^{T} p_t(\pi_t)$
where $\pi_t$ is the token in path $\pi$ at time step $t$ .
Total Target Probability: The total probability of the target transcription $Y$ is the sum of the probabilities of all paths $\pi$ that can be collapsed into $Y$ .
$P(Y | X) = \sum_{\pi \in \text{valid paths}} P(\pi | X)$
The Loss Function: The CTC loss is the negative log-likelihood of this total probability. By minimizing this loss, we are maximizing the probability that the network will output a path that decodes to the correct text.
$\mathcal{L}_{CTC} = - \log P(Y | X)$

Calculating this sum across all possible paths seems computationally intractable, as the number of paths grows exponentially with the sequence length. However, this summation can be calculated efficiently using a dynamic programming approach known as the forward-backward algorithm, which is similar to algorithms used in Hidden Markov Models (HMMs). Fortunately, deep learning frameworks like TensorFlow and PyTorch provide optimized implementations, so you only need to call the function without worrying about the underlying complexity.

Decoding a Trained CTC Model

Once a model is trained with CTC loss, we can use it for inference to convert new audio into text. The network produces a matrix of probabilities of size $T \times (\text{num_characters} + 1)$ . The process of generating text from this matrix is called decoding.

The simplest decoding method is greedy decoding (also known as best path decoding). At each time step $t$ , we simply select the token with the highest probability. This gives us a single, most likely path. We then collapse this path using the CTC rules.

For example, if the argmax at each time step yields the path H-H--E-L-L--L-O-O, greedy decoding would collapse this to:

H-H--E-L-L--L-O-O -> H--E-L--L-O (merge repeats)
H--E-L--L-O -> HELO (remove blanks)

While fast, greedy decoding is not optimal because the most probable path might not result in the most probable transcription. The sum of probabilities of many "good" paths could be higher than the probability of the single "best" path. More sophisticated algorithms like beam search, which we will discuss in Chapter 5, can achieve better results by exploring a larger set of high-probability paths during decoding.

Was this section helpful?

References

Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks, Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber, 2006 Proceedings of the 23rd International Conference on Machine Learning (ICML '06) (Association for Computing Machinery) DOI: 10.1145/1143844.1143891 - The original paper introducing the Connectionist Temporal Classification (CTC) loss function, which enables sequence-to-sequence learning without explicit alignments.
tf.nn.ctc_loss, TensorFlow Developers, 2024 - Official documentation for TensorFlow's implementation of the CTC loss function, providing practical usage details.
torch.nn.CTCLoss, PyTorch Developers, 2024 (PyTorch Foundation) - Official documentation for PyTorch's implementation of the CTC loss function, detailing its API and usage.
Speech and Language Processing (3rd ed. draft), Daniel Jurafsky and James H. Martin, 2025 (Stanford University) - A comprehensive textbook that covers speech recognition fundamentals, including a detailed explanation of CTC.