Before the advent of Transformers, Recurrent Neural Networks (RNNs) were the standard architecture for handling sequential data, such as text or time series. Unlike feed-forward networks, which process inputs independently, RNNs possess a form of memory, allowing information from previous steps in the sequence to influence the processing of the current step. This makes them naturally suited for tasks where context and order matter.

The Core Idea: Hidden States and Loops

The central concept in an RNN is the hidden state, often denoted as $h_t$ for time step $t$ . This hidden state acts as a compressed summary of the information seen in the sequence up to that point. At each time step $t$ , the RNN takes two inputs: the current input element $x_t$ from the sequence and the hidden state from the previous time step $h_{t-1}$ . It then computes a new hidden state $h_t$ and, optionally, an output $y_t$ .

Think of reading a sentence: "The cat sat on the ___". To predict the next word, you need to remember "The cat sat on the". An RNN mimics this by updating its hidden state as it processes each word, carrying forward relevant context.

This process involves a loop: the same set of operations and weights are applied at every time step, using the previous hidden state as input. This shared weight structure makes RNNs parameter-efficient, as they don't need separate parameters for each position in the sequence.

Mathematical Formulation

Let's look at the computations inside a simple RNN cell at time step $t$ :

Calculate the new hidden state $h_t$ : This is typically done by combining the current input $x_t$ and the previous hidden state $h_{t-1}$ using weight matrices and an activation function (often hyperbolic tangent, $\tanh$ ).
$h_t = \tanh(W_{xh} x_t + W_{hh} h_{t-1} + b_h)$
Here:
- $x_t$ is the input vector at time step $t$ .
- $h_{t-1}$ is the hidden state vector from the previous time step (initialized typically as zeros for $h_0$ ).
- $W_{xh}$ is the weight matrix connecting the input $x_t$ to the hidden state.
- $W_{hh}$ is the weight matrix connecting the previous hidden state $h_{t-1}$ to the current hidden state.
- $b_h$ is the bias vector for the hidden state computation.
- $\tanh$ is the hyperbolic tangent activation function, squashing the values to be between -1 and 1.
Calculate the output $y_t$ (optional): Depending on the task, an output might be generated at each time step based on the current hidden state.
$y_t = W_{hy} h_t + b_y$
Here:
- $h_t$ is the current hidden state vector.
- $W_{hy}$ is the weight matrix connecting the hidden state to the output.
- $b_y$ is the bias vector for the output computation.
- An activation function (like softmax for classification or linear for regression) might be applied to $y_t$ depending on the specific application.

The key is that the weight matrices ( $W_{xh}, W_{hh}, W_{hy}$ ) and biases ( $b_h, b_y$ ) are the same across all time steps. The network learns a single transition function that is applied repeatedly.

Unrolling the Network Through Time

While we often draw an RNN cell with a loop, it's useful to visualize it "unrolled" across the sequence length. This shows how the computation flows from one time step to the next.

An RNN unrolled through three time steps. The same RNN cell (representing shared weights $W_{xh}, W_{hh}, W_{hy}$ ) processes input $x_t$ and the previous hidden state $h_{t-1}$ to produce the current hidden state $h_t$ and output $y_t$ .

Simple PyTorch Implementation

PyTorch provides convenient modules for RNNs. Here's a basic example of defining and using a single-layer RNN:

import torch
import torch.nn as nn

# Define parameters
input_size = 10  # Dimension of input vector x_t
hidden_size = 20 # Dimension of hidden state h_t
sequence_length = 5
batch_size = 3

# Create an RNN layer
# batch_first=True means input/output tensors have batch dim first
# (batch, seq, feature)
rnn_layer = nn.RNN(input_size, hidden_size, batch_first=True)

# Create some dummy input data
# Shape: (batch_size, sequence_length, input_size)
input_sequence = torch.randn(batch_size, sequence_length, input_size)

# Initialize hidden state (optional, defaults to zeros)
# Shape: (num_layers * num_directions, batch_size, hidden_size)
# -> (1, 3, 20) for this case
initial_hidden_state = torch.zeros(1, batch_size, hidden_size)

# Pass the input sequence and initial hidden state through the RNN
# output contains the hidden state for *each* time step
# final_hidden_state contains only the *last* hidden state
output, final_hidden_state = rnn_layer(input_sequence, initial_hidden_state)

print("Input shape:", input_sequence.shape)
# Output shape: (batch_size, sequence_length, hidden_size)
print("Output shape:", output.shape)
# Final hidden state shape: (num_layers * num_directions, batch_size,
#                            hidden_size)
print("Final hidden state shape:", final_hidden_state.shape)

# Example: Accessing hidden state at the last time step from the output
last_time_step_output = output[:, -1, :]
print("Last time step hidden state from output shape:",
      last_time_step_output.shape)

# Verify it matches the final_hidden_state (squeeze the first dimension)
print(
    "Are final hidden state and last output step equal?",
    torch.allclose(
        final_hidden_state.squeeze(0),
        last_time_step_output
    )
)

This simple structure allows RNNs to model sequential dependencies. However, as we will see in the next section, basic RNNs struggle with learning relationships between elements that are far apart in the sequence. This limitation paved the way for more complex architectures like LSTMs and GRUs.

Was this section helpful?