Hands-on Practical: Training a Simple LSTM Acoustic Model with CTC

This practical exercise focuses on building and training an acoustic model that utilizes LSTMs and Connectionist Temporal Classification (CTC) loss. It will guide you through the process of building a simple but functional end-to-end speech recognition system using PyTorch. We will use a small, manageable dataset to focus on the implementation details without being burdened by long training times.

Our goal is to create a model that takes Mel spectrograms as input and outputs a sequence of character probabilities, which are then used by the CTC loss function to compute the gradient and train the network.

1. Project Setup and Data Preparation

First, ensure you have the necessary libraries installed. We will primarily use torch and torchaudio for modeling and data processing, and librosa for feature extraction.

pip install torch torchaudio librosa

For this exercise, we'll assume you have a pre-processed dataset consisting of audio files and a corresponding metadata file (e.g., a .csv or .json) that maps each audio file to its transcription. Let's define a PyTorch Dataset to handle loading, processing, and tokenizing our data.

A critical first step is to define our alphabet. The model's output layer must have one node for each possible character, plus an additional node for the special blank token required by CTC.

# A simple character set for English
# The blank token is typically at index 0
char_map_str = """
' 0
<SPACE> 1
a 2
b 3
c 4
d 5
e 6
f 7
g 8
h 9
i 10
j 11
k 12
l 13
m 14
n 15
o 16
p 17
q 18
r 19
s 20
t 21
u 22
v 23
w 24
x 25
y 26
z 27
"""
# In a real project, you would generate this from your training data's transcripts.

Next, we create a custom collate_fn for our DataLoader. Because each audio clip and its transcription has a different length, we cannot simply stack them into a batch. This function will pad each sequence in a batch to the length of the longest sequence, and it will also keep track of the original, unpadded lengths. The CTC loss function requires these original lengths to work correctly.

# In your data loading script
import torch
import torchaudio

def collate_fn(batch):
    # A batch consists of a list of tuples: (spectrogram, label, input_len, label_len)
    spectrograms = [item[0] for item in batch]
    labels = [item[1] for item in batch]
    input_lengths = [item[2] for item in batch]
    label_lengths = [item[3] for item in batch]

    # Pad the spectrograms and labels
    padded_spectrograms = torch.nn.utils.rnn.pad_sequence(spectrograms, batch_first=True)
    padded_labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True)

    return padded_spectrograms, padded_labels, torch.tensor(input_lengths), torch.tensor(label_lengths)

This collate_fn is a standard pattern when working with variable-length sequence data in PyTorch and is essential for batching our speech data.

2. The LSTM-CTC Model Architecture

Our acoustic model will have a straightforward architecture. It will consist of a few bidirectional LSTM layers followed by a fully connected linear layer. The LSTMs are responsible for learning temporal patterns in the speech features, while the final linear layer projects the LSTM's output into a probability distribution over our character vocabulary for each time step.

Let's define this model in PyTorch.

import torch.nn as nn

class LSTMAcousticModel(nn.Module):
    def __init__(self, n_features, n_hidden, n_class, n_layers, dropout):
        super(LSTMAcousticModel, self).__init__()

        # The LSTM layers
        self.lstm = nn.LSTM(
            input_size=n_features,
            hidden_size=n_hidden,
            num_layers=n_layers,
            batch_first=True,
            bidirectional=True,
            dropout=dropout
        )

        # The classification layer
        # Output is 2 * n_hidden because the LSTM is bidirectional
        self.classifier = nn.Linear(n_hidden * 2, n_class)

    def forward(self, x):
        # x is the input spectrogram: (batch, time, features)
        lstm_out, _ = self.lstm(x)

        # Pass the LSTM output through the classifier
        # The output is (batch, time, n_class)
        output = self.classifier(lstm_out)

        # We need a log_softmax for the CTC Loss
        # The CTC loss expects the time dimension to be first
        return nn.functional.log_softmax(output, dim=2).permute(1, 0, 2)

The diagram below illustrates the flow of data through our model. Input spectrograms are processed by the bidirectional LSTM, and the resulting hidden states are passed to a linear layer that produces the character probabilities required for CTC.

Data flow within the simple LSTM-based acoustic model. The final permute operation in the code reorders the tensor dimensions to match what PyTorch's CTCLoss expects.

3. The Training Loop

With the model and data loader ready, we can write the main training function. This function will orchestrate the process of feeding data to the model, calculating the loss, and updating the model's weights.

The core of this process is torch.nn.CTCLoss. It requires four arguments:

log_probs: The log-probability outputs from our model, with shape (Time, Batch, Classes).
targets: A 1D tensor of all character labels in the batch, concatenated together.
input_lengths: A tensor containing the original length of each spectrogram in the batch.
target_lengths: A tensor containing the original length of each transcription in the batch.

Here is a simplified function for a single training epoch.

def train_epoch(model, device, train_loader, criterion, optimizer):
    model.train()
    running_loss = 0.0

    for i, batch in enumerate(train_loader):
        # Move data to the selected device (e.g., GPU)
        spectrograms, labels, input_lengths, label_lengths = batch
        spectrograms, labels = spectrograms.to(device), labels.to(device)

        optimizer.zero_grad()

        # Forward pass
        # The output shape is (Time, Batch, n_class)
        output = model(spectrograms)

        # Calculate CTC loss
        loss = criterion(output, labels, input_lengths, label_lengths)

        # Backward pass and optimization
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    avg_loss = running_loss / len(train_loader)
    print(f"Training Loss: {avg_loss:.4f}")
    return avg_loss

To run the training, you would instantiate the model, loss function, and optimizer, then call this train_epoch function in a loop.

# Hyperparameters
n_features = 128   # Number of features in the Mel spectrogram
n_hidden = 256
n_class = 28       # Number of characters in our vocabulary + 1 for blank
n_layers = 2
dropout = 0.2
learning_rate = 1e-4
epochs = 10

# Setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = LSTMAcousticModel(n_features, n_hidden, n_class, n_layers, dropout).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
criterion = nn.CTCLoss(blank=0).to(device) # blank=0 assumes blank is at index 0

# Dummy train_loader and valid_loader for demonstration
# In practice, these would be torch.utils.data.DataLoader instances
# train_loader = ...
# valid_loader = ...

# Training loop
for epoch in range(epochs):
    print(f"Epoch {epoch+1}/{epochs}")
    train_loss = train_epoch(model, device, train_loader, criterion, optimizer)
    # You would also typically run a validation loop here
    # valid_loss = evaluate(...)

4. Observing the Results

As training progresses, you should see the CTC loss decrease, indicating that the model is learning to map the audio features to the correct character sequences. Plotting the training and validation loss over epochs is a standard way to monitor this process and check for overfitting. A healthy training process shows both loss values decreasing steadily.

Example training curve for an LSTM-CTC model. The gap between training and validation loss suggests the model might be starting to overfit, a common issue that can be addressed with more data, regularization, or data augmentation.

After training, you can use a simple greedy decoder to transcribe a new audio file. This involves passing the spectrogram through the model, taking the argmax of the output probabilities at each time step to get the most likely character, and then collapsing repeated characters and removing blank tokens to produce the final text.

This hands-on session provided a complete, if simple, pipeline for training an acoustic model. You have successfully built a system that learns to transcribe speech from audio features. While this model is a great start, its performance can be significantly improved with more advanced architectures and decoding techniques, which we will cover in the subsequent chapters.

Was this section helpful?

References

Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks, Alex Graves, Santiago Fernández, Faustino Gomez, Jürgen Schmidhuber, 2006 Proceedings of the 23rd International Conference on Machine Learning (ICML 2006) (ACM) DOI: 10.1145/1143844.1143891 - Introduces the Connectionist Temporal Classification (CTC) algorithm, a method for training recurrent neural networks to transcribe unsegmented sequences, foundational for end-to-end speech recognition.
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Daniel Jurafsky and James H. Martin, 2025 (Pearson) - A comprehensive textbook covering speech recognition fundamentals, including acoustic modeling, recurrent neural networks like LSTMs, and the application of CTC in modern ASR systems.
CTCLoss - PyTorch 2.3 documentation, PyTorch Core Team, 2024 (PyTorch Foundation) - Official documentation for PyTorch's implementation of the Connectionist Temporal Classification (CTC) loss function, detailing its parameters and expected input/output formats.
torchaudio: Speech and Audio Processing - Torchaudio 2.3 documentation, Torchaudio Developers, 2024 (PyTorch Foundation) - The official documentation for torchaudio, a PyTorch-native audio and speech processing library, relevant for feature extraction and dataset preparation in speech recognition tasks.