As discussed previously, methods like Bag-of-Words and TF-IDF create representations that lose the inherent order of words in a text. This works well for some tasks, but many NLP problems depend heavily on understanding sequences. Consider sentiment analysis ("The movie was not good") versus ("The movie was good, not great") or machine translation, where word order is fundamental to meaning. To handle such sequential information, we need models designed specifically for ordered data. This brings us to Recurrent Neural Networks (RNNs).
The core idea behind an RNN is the concept of recurrence: performing the same task for every element of a sequence, where the output for each element depends on the computations from previous elements. Think of it like reading a sentence; your understanding of the current word is informed by the words you've already seen. RNNs mimic this by maintaining an internal state or memory (ht) that summarizes the information processed so far in the sequence.
Unlike standard feedforward networks, which process fixed-size inputs independently, RNNs have a loop. This loop allows information to persist from one step of the network to the next. At each time step t, the RNN takes the input for that step (xt) and the hidden state from the previous step (ht−1) to compute the new hidden state (ht). This new state then serves as the memory for processing the next element in the sequence.
A conceptual diagram of an RNN cell showing the input xt, the hidden state ht being calculated based on xt and the previous hidden state ht−1, and the output yt. The loop indicates the recurrence.
To better understand how an RNN processes a sequence, it's helpful to visualize it "unrolled" or "unfolded" through time. Imagine making a copy of the network for each time step in the sequence. The hidden state from one time step is passed to the next. Importantly, the same set of parameters (weights and biases) are used across all time steps. This parameter sharing makes RNNs efficient and allows them to generalize to sequences of varying lengths.
Consider processing a sequence of length T: x1,x2,...,xT. The unrolled network would look like this:
An RNN unrolled over time steps 1,2,...,T. The blue arrows show the transfer of the hidden state (ht) from one time step to the next. Note that the RNN Cell block represents the same set of weights applied at each step. h0 is the initial hidden state, often set to zeros.
Let's formalize the calculations within a simple RNN cell. At each time step t:
Calculate the new hidden state (ht): This is typically done using the current input (xt) and the previous hidden state (ht−1). An activation function (commonly tanh
or ReLU
) is applied.
Here:
Calculate the output (yt): The output at time step t is often calculated based on the hidden state ht. The specific calculation and activation function depend on the task (e.g., softmax for classification).
yt=Whyht+byHere:
The network learns the weight matrices (Whh,Wxh,Why) and bias vectors (bh,by) during training, typically using a variant of backpropagation called Backpropagation Through Time (BPTT).
Imagine feeding the sentence "RNNs process sequences" into an RNN, perhaps one word (or its embedding) at a time.
The final hidden state (h3 in this case) or the outputs at each step (y1,y2,y3) can be used for various tasks. For example, h3 could be fed into a classifier for sentiment analysis of the whole sentence, or the outputs yt could represent predictions for the next word at each step.
RNNs provide a foundational architecture for modeling sequential data in NLP. Their ability to maintain a state allows them to capture dependencies between elements in a sequence, overcoming a major limitation of simpler models. However, as we will see in the next section, basic RNNs face challenges when dealing with dependencies over long intervals in the sequence.
© 2025 ApX Machine Learning