This practical exercise focuses on building and training an acoustic model that utilizes LSTMs and Connectionist Temporal Classification (CTC) loss. It will guide you through the process of building a simple but functional end-to-end speech recognition system using PyTorch. We will use a small, manageable dataset to focus on the implementation details without being burdened by long training times.
Our goal is to create a model that takes Mel spectrograms as input and outputs a sequence of character probabilities, which are then used by the CTC loss function to compute the gradient and train the network.
First, ensure you have the necessary libraries installed. We will primarily use torch and torchaudio for modeling and data processing, and librosa for feature extraction.
pip install torch torchaudio librosa
For this exercise, we'll assume you have a pre-processed dataset consisting of audio files and a corresponding metadata file (e.g., a .csv or .json) that maps each audio file to its transcription. Let's define a PyTorch Dataset to handle loading, processing, and tokenizing our data.
A critical first step is to define our alphabet. The model's output layer must have one node for each possible character, plus an additional node for the special blank token required by CTC.
# A simple character set for English
# The blank token is typically at index 0
char_map_str = """
' 0
<SPACE> 1
a 2
b 3
c 4
d 5
e 6
f 7
g 8
h 9
i 10
j 11
k 12
l 13
m 14
n 15
o 16
p 17
q 18
r 19
s 20
t 21
u 22
v 23
w 24
x 25
y 26
z 27
"""
# In a real project, you would generate this from your training data's transcripts.
Next, we create a custom collate_fn for our DataLoader. Because each audio clip and its transcription has a different length, we cannot simply stack them into a batch. This function will pad each sequence in a batch to the length of the longest sequence, and it will also keep track of the original, unpadded lengths. The CTC loss function requires these original lengths to work correctly.
# In your data loading script
import torch
import torchaudio
def collate_fn(batch):
# A batch consists of a list of tuples: (spectrogram, label, input_len, label_len)
spectrograms = [item[0] for item in batch]
labels = [item[1] for item in batch]
input_lengths = [item[2] for item in batch]
label_lengths = [item[3] for item in batch]
# Pad the spectrograms and labels
padded_spectrograms = torch.nn.utils.rnn.pad_sequence(spectrograms, batch_first=True)
padded_labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True)
return padded_spectrograms, padded_labels, torch.tensor(input_lengths), torch.tensor(label_lengths)
This collate_fn is a standard pattern when working with variable-length sequence data in PyTorch and is essential for batching our speech data.
Our acoustic model will have a straightforward architecture. It will consist of a few bidirectional LSTM layers followed by a fully connected linear layer. The LSTMs are responsible for learning temporal patterns in the speech features, while the final linear layer projects the LSTM's output into a probability distribution over our character vocabulary for each time step.
Let's define this model in PyTorch.
import torch.nn as nn
class LSTMAcousticModel(nn.Module):
def __init__(self, n_features, n_hidden, n_class, n_layers, dropout):
super(LSTMAcousticModel, self).__init__()
# The LSTM layers
self.lstm = nn.LSTM(
input_size=n_features,
hidden_size=n_hidden,
num_layers=n_layers,
batch_first=True,
bidirectional=True,
dropout=dropout
)
# The classification layer
# Output is 2 * n_hidden because the LSTM is bidirectional
self.classifier = nn.Linear(n_hidden * 2, n_class)
def forward(self, x):
# x is the input spectrogram: (batch, time, features)
lstm_out, _ = self.lstm(x)
# Pass the LSTM output through the classifier
# The output is (batch, time, n_class)
output = self.classifier(lstm_out)
# We need a log_softmax for the CTC Loss
# The CTC loss expects the time dimension to be first
return nn.functional.log_softmax(output, dim=2).permute(1, 0, 2)
The diagram below illustrates the flow of data through our model. Input spectrograms are processed by the bidirectional LSTM, and the resulting hidden states are passed to a linear layer that produces the character probabilities required for CTC.
Data flow within the simple LSTM-based acoustic model. The final
permuteoperation in the code reorders the tensor dimensions to match what PyTorch'sCTCLossexpects.
With the model and data loader ready, we can write the main training function. This function will orchestrate the process of feeding data to the model, calculating the loss, and updating the model's weights.
The core of this process is torch.nn.CTCLoss. It requires four arguments:
log_probs: The log-probability outputs from our model, with shape (Time, Batch, Classes).targets: A 1D tensor of all character labels in the batch, concatenated together.input_lengths: A tensor containing the original length of each spectrogram in the batch.target_lengths: A tensor containing the original length of each transcription in the batch.Here is a simplified function for a single training epoch.
def train_epoch(model, device, train_loader, criterion, optimizer):
model.train()
running_loss = 0.0
for i, batch in enumerate(train_loader):
# Move data to the selected device (e.g., GPU)
spectrograms, labels, input_lengths, label_lengths = batch
spectrograms, labels = spectrograms.to(device), labels.to(device)
optimizer.zero_grad()
# Forward pass
# The output shape is (Time, Batch, n_class)
output = model(spectrograms)
# Calculate CTC loss
loss = criterion(output, labels, input_lengths, label_lengths)
# Backward pass and optimization
loss.backward()
optimizer.step()
running_loss += loss.item()
avg_loss = running_loss / len(train_loader)
print(f"Training Loss: {avg_loss:.4f}")
return avg_loss
To run the training, you would instantiate the model, loss function, and optimizer, then call this train_epoch function in a loop.
# Hyperparameters
n_features = 128 # Number of features in the Mel spectrogram
n_hidden = 256
n_class = 28 # Number of characters in our vocabulary + 1 for blank
n_layers = 2
dropout = 0.2
learning_rate = 1e-4
epochs = 10
# Setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = LSTMAcousticModel(n_features, n_hidden, n_class, n_layers, dropout).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
criterion = nn.CTCLoss(blank=0).to(device) # blank=0 assumes blank is at index 0
# Dummy train_loader and valid_loader for demonstration
# In practice, these would be torch.utils.data.DataLoader instances
# train_loader = ...
# valid_loader = ...
# Training loop
for epoch in range(epochs):
print(f"Epoch {epoch+1}/{epochs}")
train_loss = train_epoch(model, device, train_loader, criterion, optimizer)
# You would also typically run a validation loop here
# valid_loss = evaluate(...)
As training progresses, you should see the CTC loss decrease, indicating that the model is learning to map the audio features to the correct character sequences. Plotting the training and validation loss over epochs is a standard way to monitor this process and check for overfitting. A healthy training process shows both loss values decreasing steadily.
Example training curve for an LSTM-CTC model. The gap between training and validation loss suggests the model might be starting to overfit, a common issue that can be addressed with more data, regularization, or data augmentation.
After training, you can use a simple greedy decoder to transcribe a new audio file. This involves passing the spectrogram through the model, taking the argmax of the output probabilities at each time step to get the most likely character, and then collapsing repeated characters and removing blank tokens to produce the final text.
This hands-on session provided a complete, if simple, pipeline for training an acoustic model. You have successfully built a system that learns to transcribe speech from audio features. While this model is a great start, its performance can be significantly improved with more advanced architectures and decoding techniques, which we will cover in the subsequent chapters.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with