Building Acoustic Models with Recurrent Neural Networks

After transforming raw audio into a sequence of feature vectors, the immediate challenge is to model the temporal relationships within that sequence. A standard feed-forward neural network processes each input vector independently, ignoring its position in time. This is insufficient for speech, where the order of sounds defines words and meaning. For example, the phonemes in "cat" are the same as in "act," but their order is what distinguishes them. To address this, we turn to a class of models designed specifically for sequential data: Recurrent Neural Networks (RNNs).

The Recurrence Mechanism for Time-Series Data

An RNN is uniquely suited for time-series data because it has a form of memory. It processes a sequence one element at a time, and at each step, its calculations include information from the previous step. This is achieved through a hidden state, which acts as a summary of the sequence seen so far.

The core operation of a simple RNN at each time step $t$ can be described by the following equation:

h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)

Here:

$x_t$ is the input feature vector at the current time step (e.g., a set of MFCCs for one audio frame).
$h_{t-1}$ is the hidden state from the previous time step. It carries information about the past. For the very first step ( $t=1$ ), $h_0$ is typically initialized as a vector of zeros.
$W_{xh}$ and $W_{hh}$ are weight matrices that the network learns during training. $W_{xh}$ transforms the current input, and $W_{hh}$ transforms the previous hidden state.
$b_h$ is a bias term.
$\tanh$ is the hyperbolic tangent activation function, which squashes the output to a range of [-1, 1].
$h_t$ is the new hidden state, which captures both the current input and the context from the past. This hidden state is then passed to the next time step.

To generate a prediction, the hidden state at each step is passed through another layer, typically a fully connected layer with a softmax activation, to produce an output vector $o_t$ :

o_t = \text{softmax}(W_{ho} h_t + b_o)

This output $o_t$ represents a probability distribution over our target vocabulary (e.g., all characters 'a'-'z', '0'-'9', and special symbols) for that specific time step. The diagram below illustrates this process, showing the network "unrolled" in time.

An RNN processes a sequence of inputs ( $x_1, x_2, ...$ ) step-by-step. At each step, it produces an output ( $o_1, o_2, ...$ ) and updates its hidden state ( $h_1, h_2, ...$ ), which carries context forward to the next step.

Applying RNNs to an Acoustic Model

In our ASR pipeline, the sequence of feature vectors extracted from the audio serves as the input sequence $X = (x_1, x_2, ..., x_T)$ to the RNN.

Input: Each $x_t$ is a vector representing a short time slice of audio, such as 40 MFCCs or a column from a log-mel spectrogram.
Processing: The RNN iterates through these feature vectors from $t=1$ to $T$ . The hidden state $h_t$ learns to represent acoustic features, effectively summarizing the sounds heard up to that point. For example, after processing several frames, the hidden state might encode the presence of a fricative sound like /s/ transitioning into a vowel.
Output: At each time step $t$ , the model produces an output vector $o_t$ of probabilities across our vocabulary. If our vocabulary consists of 26 letters, a space, and an apostrophe, the output vector at each time step will have 28 values. The highest value indicates the character the model thinks is most likely at that exact audio frame.

This process results in an output sequence of probability distributions that is the same length as the input feature sequence ( $T$ ). This presents a challenge: a 10-second audio clip might produce 1000 feature vectors and thus 1000 output predictions, while the corresponding text transcription might only be 50 characters long. This mismatch in length is a fundamental problem that we will solve with the Connectionist Temporal Classification (CTC) loss function, which is covered later in this chapter.

A Practical View in PyTorch

Implementing a basic RNN layer is straightforward in modern deep learning frameworks. In PyTorch, you can define a simple RNN-based acoustic model like this:

import torch
import torch.nn as nn

class SimpleASR_RNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(SimpleASR_RNN, self).__init__()
        self.hidden_size = hidden_size
        # The RNN layer processes the sequence of input features
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        # A fully connected layer to map hidden states to character probabilities
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        # Initialize the hidden state for the first time step
        # Shape: (num_layers, batch_size, hidden_size)
        h0 = torch.zeros(1, x.size(0), self.hidden_size).to(x.device)
        
        # Pass the input sequence and initial hidden state to the RNN
        # out: contains the output hidden state for each time step
        # hidden: contains the final hidden state of the sequence
        out, hidden = self.rnn(x, h0)
        
        # Pass the RNN's output through the fully connected layer
        # to get predictions for each time step
        out = self.fc(out)
        return out

# Example usage:
# num_features = 40 (e.g., 40 MFCCs)
# rnn_hidden_size = 256
# num_output_classes = 29 (e.g., 26 letters + space + apostrophe + blank token)

# model = SimpleASR_RNN(num_features, rnn_hidden_size, num_output_classes)

In this code:

input_size corresponds to the number of features in each input vector $x_t$ (e.g., 40).
hidden_size is the dimension of the hidden state vector $h_t$ . This is a hyperparameter you can tune.
num_classes is the size of your output vocabulary.

Limitations of Simple RNNs

While RNNs are a good starting point, simple RNNs struggle with long sequences. The influence of an input from an early time step gradually diminishes as the network processes more of the sequence. This is known as the vanishing gradient problem, where the gradients used to update the network's weights become infinitesimally small over long distances, making it difficult for the model to learn long-range dependencies. In speech recognition, this means the model might forget the beginning of a long sentence by the time it reaches the end.

To overcome this significant limitation, more sophisticated recurrent architectures were developed. The next section will introduce Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks, which are designed to better capture and maintain context over long sequences.

Was this section helpful?

References

Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - Provides a foundational treatment of recurrent neural networks, including their architecture, training, and the vanishing gradient problem.
Long Short-Term Memory, Sepp Hochreiter and Jürgen Schmidhuber, 1997 Neural Computation, Vol. 9 (MIT Press) DOI: 10.1162/neco.1997.9.8.1735 - Introduced the Long Short-Term Memory (LSTM) network, a significant advance for learning long-range dependencies in sequential data, addressing the vanishing gradient problem.
RNN - PyTorch 2.3 documentation, PyTorch Developers, 2024 (PyTorch) - Official documentation for PyTorch's torch.nn.RNN module, detailing its parameters and usage for building recurrent layers.