Components are assembled into a trainable acoustic model using Connectionist Temporal Classification (CTC). The primary goal is to design a neural network that transforms a sequence of audio features into a sequence of probability distributions over a character vocabulary. This output is precisely what the CTC loss function needs to calculate a loss and train the network.
A standard CTC-based model is a stack of neural network layers, each with a specific responsibility. The architecture is designed to handle the sequential and variable-length nature of speech.
The model typically consists of three main parts: recurrent layers for processing sequences, a linear layer for projecting features into the vocabulary space, and a softmax layer to create probability distributions.
The data flow through a typical CTC-based acoustic model. Feature sequences are processed by recurrent layers, projected to the vocabulary dimension, and converted into probabilities. The CTC loss function then compares this output with the target text to train the model.
Let's examine each component of this architecture.
The heart of the acoustic model is a stack of recurrent layers, most commonly Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU) networks. These are chosen because speech is inherently sequential; the meaning of a sound at a given moment often depends on the sounds that came before and after it.
The input to this recurrent core is the batch of feature sequences, typically with a shape of (batch_size, time_steps, num_features). The output is a new sequence of hidden states, one for each time step. If a bidirectional LSTM with a hidden size of H is used, the output shape for each sequence will be (time_steps, 2 * H).
After the recurrent layers have processed the audio features and captured temporal patterns, we need to map their output to our desired vocabulary. This is done with a standard fully connected (or "Dense") layer.
This layer is applied independently to every single time step in the output sequence from the LSTMs. This is often referred to as a "time-distributed" dense layer. Its job is to take the hidden state at each time step and project it into a vector whose length is equal to the size of our vocabulary plus one.
<blank> token required by CTC. If our character set has 28 symbols, the output dimension of this layer will be 29. The output of this layer for each time step is a vector of raw, unnormalized scores called logits.For a batch, the input to this layer might be (batch_size, time_steps, 2 * H), and its output will be (batch_size, time_steps, vocab_size + 1).
The final step within the model is to convert the raw logit scores from the dense layer into probabilities. A softmax activation function is applied to the last dimension (the vocabulary dimension) of the output tensor.
For each time step, the softmax function takes the vector of logits and normalizes it into a probability distribution, where all values are between 0 and 1 and sum to 1. The resulting tensor, often called the emission matrix or probability matrix, has a shape of (batch_size, time_steps, vocab_size + 1). Each entry at (t, c) represents the probability of observing character c at time step t.
This probability matrix is the final output of the acoustic model during training and is the input that the CTC loss function requires.
In a framework like PyTorch or TensorFlow, you can define this architecture as a sequence of layers. The following example shows a simplified structure in PyTorch.
import torch
import torch.nn as nn
class CTCAcousticModel(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim, n_layers=2):
super(CTCAcousticModel, self).__init__()
# Recurrent layers to process the feature sequence
self.lstm = nn.LSTM(
input_size=input_dim,
hidden_size=hidden_dim,
num_layers=n_layers,
bidirectional=True,
batch_first=True # Expects input as (batch, seq, feature)
)
# Fully connected layer to map LSTM outputs to vocabulary size
# The input dimension is hidden_dim * 2 because the LSTM is bidirectional
self.fc = nn.Linear(hidden_dim * 2, output_dim)
def forward(self, x):
# x shape: (batch, time_steps, num_features)
# Pass input through LSTM layers
lstm_out, _ = self.lstm(x)
# lstm_out shape: (batch, time_steps, hidden_dim * 2)
# Pass each time step's output through the fully connected layer
logits = self.fc(lstm_out)
# logits shape: (batch, time_steps, output_dim)
# Most CTC loss implementations expect log probabilities for numerical stability
return nn.functional.log_softmax(logits, dim=2)
This class encapsulates the entire process. It takes a batch of feature sequences and returns a batch of log-probability matrices, ready to be passed to a CTC loss function along with the ground truth transcriptions. In the next section, we will put this all together in a hands-on session to train our first acoustic model.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with