After transforming raw audio into a sequence of feature vectors, the immediate challenge is to model the temporal relationships within that sequence. A standard feed-forward neural network processes each input vector independently, ignoring its position in time. This is insufficient for speech, where the order of sounds defines words and meaning. For example, the phonemes in "cat" are the same as in "act," but their order is what distinguishes them. To address this, we turn to a class of models designed specifically for sequential data: Recurrent Neural Networks (RNNs).
An RNN is uniquely suited for time-series data because it has a form of memory. It processes a sequence one element at a time, and at each step, its calculations include information from the previous step. This is achieved through a hidden state, which acts as a summary of the sequence seen so far.
The core operation of a simple RNN at each time step t can be described by the following equation:
ht=tanh(Whhht−1+Wxhxt+bh)Here:
To generate a prediction, the hidden state at each step is passed through another layer, typically a fully connected layer with a softmax activation, to produce an output vector ot:
ot=softmax(Whoht+bo)This output ot represents a probability distribution over our target vocabulary (e.g., all characters 'a'-'z', '0'-'9', and special symbols) for that specific time step. The diagram below illustrates this process, showing the network "unrolled" in time.
An RNN processes a sequence of inputs (x1,x2,...) step-by-step. At each step, it produces an output (o1,o2,...) and updates its hidden state (h1,h2,...), which carries context forward to the next step.
In our ASR pipeline, the sequence of feature vectors extracted from the audio serves as the input sequence X=(x1,x2,...,xT) to the RNN.
This process results in an output sequence of probability distributions that is the same length as the input feature sequence (T). This presents a challenge: a 10-second audio clip might produce 1000 feature vectors and thus 1000 output predictions, while the corresponding text transcription might only be 50 characters long. This mismatch in length is a fundamental problem that we will solve with the Connectionist Temporal Classification (CTC) loss function, which is covered later in this chapter.
Implementing a basic RNN layer is straightforward in modern deep learning frameworks. In PyTorch, you can define a simple RNN-based acoustic model like this:
import torch
import torch.nn as nn
class SimpleASR_RNN(nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
super(SimpleASR_RNN, self).__init__()
self.hidden_size = hidden_size
# The RNN layer processes the sequence of input features
self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
# A fully connected layer to map hidden states to character probabilities
self.fc = nn.Linear(hidden_size, num_classes)
def forward(self, x):
# Initialize the hidden state for the first time step
# Shape: (num_layers, batch_size, hidden_size)
h0 = torch.zeros(1, x.size(0), self.hidden_size).to(x.device)
# Pass the input sequence and initial hidden state to the RNN
# out: contains the output hidden state for each time step
# hidden: contains the final hidden state of the sequence
out, hidden = self.rnn(x, h0)
# Pass the RNN's output through the fully connected layer
# to get predictions for each time step
out = self.fc(out)
return out
# Example usage:
# num_features = 40 (e.g., 40 MFCCs)
# rnn_hidden_size = 256
# num_output_classes = 29 (e.g., 26 letters + space + apostrophe + blank token)
# model = SimpleASR_RNN(num_features, rnn_hidden_size, num_output_classes)
In this code:
input_size corresponds to the number of features in each input vector xt (e.g., 40).hidden_size is the dimension of the hidden state vector ht. This is a hyperparameter you can tune.num_classes is the size of your output vocabulary.While RNNs are a good starting point, simple RNNs struggle with long sequences. The influence of an input from an early time step gradually diminishes as the network processes more of the sequence. This is known as the vanishing gradient problem, where the gradients used to update the network's weights become infinitesimally small over long distances, making it difficult for the model to learn long-range dependencies. In speech recognition, this means the model might forget the beginning of a long sentence by the time it reaches the end.
To overcome this significant limitation, more sophisticated recurrent architectures were developed. The next section will introduce Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks, which are designed to better capture and maintain context over long sequences.
Was this section helpful?
torch.nn.RNN module, detailing its parameters and usage for building recurrent layers.© 2026 ApX Machine LearningEngineered with