All Courses

Recurrent Neural Network (RNN) Basics

As discussed previously, methods like Bag-of-Words and TF-IDF create representations that lose the inherent order of words in a text. This works well for some tasks, but many NLP problems depend heavily on understanding sequences. Consider sentiment analysis ("The movie was not good") versus ("The movie was good, not great") or machine translation, where word order is fundamental to meaning. To handle such sequential information, we need models designed specifically for ordered data. This brings us to Recurrent Neural Networks (RNNs).

The core idea behind an RNN is the concept of recurrence: performing the same task for every element of a sequence, where the output for each element depends on the computations from previous elements. Think of it like reading a sentence; your understanding of the current word is informed by the words you've already seen. RNNs mimic this by maintaining an internal state or memory ( $h_t$ ) that summarizes the information processed so far in the sequence.

The Recurrent Loop

Unlike standard feedforward networks, which process fixed-size inputs independently, RNNs have a loop. This loop allows information to persist from one step of the network to the next. At each time step $t$ , the RNN takes the input for that step ( $x_t$ ) and the hidden state from the previous step ( $h_{t-1}$ ) to compute the new hidden state ( $h_t$ ). This new state then serves as the memory for processing the next element in the sequence.

A diagram of an RNN cell showing the input $x_t$ , the hidden state $h_t$ being calculated based on $x_t$ and the previous hidden state $h_{t-1}$ , and the output $y_t$ . The loop indicates the recurrence.

Unrolling the Network Through Time

To better understand how an RNN processes a sequence, it's helpful to visualize it "unrolled" or "unfolded" through time. Imagine making a copy of the network for each time step in the sequence. The hidden state from one time step is passed to the next. Importantly, the same set of parameters (weights and biases) are used across all time steps. This parameter sharing makes RNNs efficient and allows them to generalize to sequences of varying lengths.

Consider processing a sequence of length $T$ : $x_1, x_2, ..., x_T$ . The unrolled network would look like this:

An RNN unrolled over time steps $1, 2, ..., T$ . The blue arrows show the transfer of the hidden state ( $h_t$ ) from one time step to the next. Note that the RNN Cell block represents the same set of weights applied at each step. $h_0$ is the initial hidden state, often set to zeros.

Mathematical Formulation

Let's formalize the calculations within a simple RNN cell. At each time step $t$ :

Calculate the new hidden state ( $h_t$ ): This is typically done using the current input ( $x_t$ ) and the previous hidden state ( $h_{t-1}$ ). An activation function (commonly tanh or ReLU) is applied.
$h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)$
Here:
- $h_t$ is the hidden state vector at time $t$ .
- $h_{t-1}$ is the hidden state vector from the previous time step.
- $x_t$ is the input vector at time $t$ .
- $W_{hh}$ is the weight matrix for the recurrent hidden state connection.
- $W_{xh}$ is the weight matrix for the input-to-hidden connection.
- $b_h$ is the bias vector for the hidden state calculation.
- $\tanh$ is the hyperbolic tangent activation function, squashing values between -1 and 1.
Calculate the output ( $y_t$ ): The output at time step $t$ is often calculated based on the hidden state $h_t$ . The specific calculation and activation function depend on the task (e.g., softmax for classification).
$y_t = W_{hy} h_t + b_y$
Here:
- $y_t$ is the output vector at time $t$ .
- $W_{hy}$ is the weight matrix for the hidden-to-output connection.
- $b_y$ is the bias vector for the output calculation.
- (An activation function like softmax might be applied to $y_t$ afterwards, depending on the application).

The network learns the weight matrices ( $W_{hh}, W_{xh}, W_{hy}$ ) and bias vectors ( $b_h, b_y$ ) during training, typically using a variant of backpropagation called Backpropagation Through Time (BPTT).

Processing Sequences in Practice

Imagine feeding the sentence "RNNs process sequences" into an RNN, perhaps one word (or its embedding) at a time.

Time t=1: Input is "RNNs" ( $x_1$ ). The network uses the initial hidden state $h_0$ (often zeros) and $x_1$ to compute $h_1$ . An output $y_1$ might also be generated. $h_1$ now contains some information derived from "RNNs".
Time t=2: Input is "process" ( $x_2$ ). The network uses $h_1$ (the memory of "RNNs") and $x_2$ to compute $h_2$ . $h_2$ now incorporates information from both "RNNs" and "process".
Time t=3: Input is "sequences" ( $x_3$ ). The network uses $h_2$ (memory of "RNNs process") and $x_3$ to compute $h_3$ . $h_3$ represents the state after processing the entire sequence.

The final hidden state ( $h_3$ in this case) or the outputs at each step ( $y_1, y_2, y_3$ ) can be used for various tasks. For example, $h_3$ could be fed into a classifier for sentiment analysis of the whole sentence, or the outputs $y_t$ could represent predictions for the next word at each step.

RNNs provide a foundational architecture for modeling sequential data in NLP. Their ability to maintain a state allows them to capture dependencies between elements in a sequence, overcoming a major limitation of simpler models. However, as we will see in the next section, basic RNNs face challenges when dealing with dependencies over long intervals in the sequence.

Was this section helpful?