Traditional hybrid Hidden Markov Model-Deep Neural Network (HMM-DNN) systems, while effective, involve distinct components (acoustic model, lexicon, language model) and rely on pre-computed alignments between audio frames and phonetic states for training. This alignment process itself can be complex and potentially suboptimal. Connectionist Temporal Classification (CTC) offers an alternative approach, enabling end-to-end training of acoustic models directly from input audio features to output character or word sequences, without needing frame-level alignments.
Imagine trying to map variable-length audio input frames x=(x1,x2,...,xT) to a much shorter, variable-length target label sequence y=(y1,y2,...,yU), where T (number of feature frames) is typically much larger than U (number of output characters). For example, a 1-second utterance might have 100 feature vectors, but only correspond to 10 characters. The core problem is determining which input frames correspond to which output labels. CTC elegantly sidesteps this explicit alignment requirement.
CTC introduces a special blank
symbol, often denoted as ϵ or <b>
. This symbol represents "no output label" at a particular time step. The acoustic model, typically an RNN or Transformer, processes the input sequence x and outputs a probability distribution p(k∣t,x) at each time step t over the vocabulary of possible output labels plus the blank symbol. Let the extended vocabulary be V′=V∪{b}. The output of the network is a sequence of probability distributions p1,p2,...,pT, where each pt is a vector of size ∣V′∣.
Consider a sequence of output predictions from the network across all time steps, π=(π1,π2,...,πT), where each πt∈V′. This is called a CTC path. The probability of a specific path π is calculated assuming conditional independence between predictions at different time steps given the input:
P(π∣x)=∏t=1Tp(πt∣t,x)
Now, the critical step is defining a many-to-one mapping B from the set of all possible paths π to the target label sequences y. This mapping B first removes all consecutive duplicate labels from the path and then removes all blank symbols.
For example, if our target label sequence is y=(c, a, t) and the vocabulary is \{\text{a, c, t, \mathbf{b}}\}, several paths could map to y:
The probability of observing the target sequence y given the input x is the sum of probabilities of all paths π that map to y:
P(y∣x)=∑π∈B−1(y)P(π∣x)
where B−1(y) denotes the set of all paths that collapse to the target sequence y.
Directly calculating P(y∣x) by summing over all valid paths is computationally infeasible because the number of such paths grows exponentially with the input length T. CTC employs a dynamic programming algorithm, similar to the forward-backward algorithm used in HMMs, to compute this sum efficiently.
Let's define y′ as the target sequence y interleaved with blank symbols at the beginning and between every label. For y=(c, a, t), y′=(b,c,b,a,b,t,b). Let ∣y′∣=2U+1.
The core idea is to compute forward probabilities α(t,u), representing the total probability of all paths ending in state u of y′ (either a blank or a character) at time step t. Similarly, backward probabilities β(t,u) represent the total probability of completing the sequence y starting from state u at time t.
The recursive computation for α(t,u) involves summing probabilities from valid predecessor states at time t−1:
The base cases are α(1,1)=p(y1′∣1,x) and α(1,2)=p(y2′∣1,x), with α(1,u)=0 for u>2.
The total probability P(y∣x) can be efficiently calculated by summing the forward probabilities at the final time step T for the last two states of y′ (the final label and the final blank):
P(y∣x)=α(T,∣y′∣)+α(T,∣y′∣−1)
The CTC loss function is then simply the negative log-likelihood of the correct label sequence:
LCTC(x,y)=−lnP(y∣x)
Training involves minimizing this loss function using gradient descent and backpropagation through the acoustic model (e.g., RNN) and the CTC calculation layer. The gradients ∂p(k∣t,x)∂LCTC can also be computed efficiently using the forward (α) and backward (β) variables.
Once the model is trained, we need a way to infer the most likely output sequence y∗ given a new input x.
y∗=argmaxyP(y∣x)
Finding the exact maximum is difficult. Common decoding strategies include:
Best Path Decoding (Greedy Search): This is the simplest approach. At each time step t, select the label πt∗ with the highest probability p(πt∣t,x). Concatenate these labels to form the path π∗=(π1∗,...,πT∗). Finally, apply the mapping B to collapse this path into the output sequence y∗=B(π∗). While fast, this method is suboptimal because it doesn't consider the combined probability of full paths and can make locally optimal choices that lead to a globally suboptimal sequence.
Beam Search Decoding: This heuristic search algorithm explores a more promising part of the search space. It maintains a set (beam) of the k most probable candidate sequences (hypotheses) at each time step. For each hypothesis in the beam, it explores possible next labels (including blank), calculates the probabilities of the extended hypotheses, and keeps only the top k overall. Probabilities for hypotheses ending in the same collapsed sequence are merged. This significantly improves performance over greedy search but increases computational cost.
Simplified illustration of beam search decoding with CTC. Nodes represent partial hypotheses (collapsed sequences) at different time steps. Edges show transitions based on network outputs P(label∣t), and the search retains only the most probable hypotheses (beam). Probabilities associated with paths leading to the same collapsed sequence are summed.
Advantages:
Disadvantages:
Connectionist Temporal Classification provides a powerful and elegant framework for building end-to-end acoustic models without requiring frame-level alignments. By introducing the blank symbol and a specialized loss function computed via dynamic programming, CTC allows models like RNNs and Transformers to be trained directly on pairings of audio sequences and their transcriptions. While it has limitations, particularly the conditional independence assumption, CTC was a significant step towards modern end-to-end ASR and remains a relevant technique, often forming a component within more complex architectures or serving as a strong baseline. Understanding CTC provides essential context for appreciating the motivations and designs of subsequent end-to-end approaches like attention-based models and RNN Transducers.
© 2025 ApX Machine Learning