Now that we understand the conceptual flow of information in an RNN – processing sequence elements one at a time while updating a hidden state – let's formalize the computations happening inside a simple RNN cell at each time step t.
Recall that the cell receives two inputs: the current input feature vector xt from the sequence and the hidden state ht−1 computed at the previous time step. Its goal is to compute the new hidden state ht, which captures information from the past up to the current step, and potentially an output yt relevant for this specific time step.
The core computations are typically defined by two equations:
Calculating the new hidden state (ht):
ht=f(Whhht−1+Wxhxt+bh)Calculating the output (yt):
yt=g(Whyht+by)Let's break down each component of these equations:
xt: This is the input vector at the current time step t. If you're processing text, this could be an embedding vector for the word at position t. If it's time series data, it might be the sensor readings at time t. Let's say xt has dimension D (i.e., xt∈RD).
ht−1: This is the hidden state vector from the previous time step t−1. It serves as the network's memory of the past. We assume the hidden state has a size N (i.e., ht−1∈RN). For the very first time step (t=0), h−1 is typically initialized, often as a vector of zeros.
ht: This is the newly computed hidden state vector for the current time step t. It also has size N (i.e., ht∈RN).
Wxh: This is the weight matrix that transforms the current input xt. It connects the input layer to the hidden layer. Its dimensions must be compatible with matrix multiplication: if xt is D-dimensional and ht is N-dimensional, then Wxh must have shape N×D.
Whh: This is the weight matrix that transforms the previous hidden state ht−1. It represents the recurrent connection from the hidden layer at the previous time step to the hidden layer at the current time step. Its dimensions must connect an N-dimensional vector (ht−1) to another N-dimensional vector (ht), so Whh has shape N×N.
bh: This is the bias vector added to the hidden state calculation. It has the same dimension as the hidden state, N (i.e., bh∈RN).
f: This is the activation function applied element-wise to compute the hidden state. In simple RNNs, this is often the hyperbolic tangent function (tanh
). The tanh
function squashes the values into the range [-1, 1], which helps regulate the activations flowing through the network. Other functions like ReLU could potentially be used, but tanh
is traditional for basic RNN hidden states.
yt: This is the output vector computed at time step t. Its dimension, let's call it K (i.e., yt∈RK), depends on the specific task. For instance, in language modeling, you might predict the next word, so K would be the vocabulary size. In time series forecasting, K might be 1 if you predict a single value. Note that an output might not be needed at every time step, depending on the application.
Why: This is the weight matrix that transforms the current hidden state ht to produce the output yt. It connects the hidden layer to the output layer. To transform an N-dimensional hidden state to a K-dimensional output, Why must have shape K×N.
by: This is the bias vector added to the output calculation. It has the same dimension as the output, K (i.e., by∈RK).
g: This is the activation function applied element-wise to compute the final output yt. The choice of g depends heavily on the nature of the task.
softmax
function.A significant aspect of RNNs is parameter sharing. The same set of weight matrices (Wxh, Whh, Why) and bias vectors (bh, by) are used to perform the calculations at every single time step. This makes the model computationally efficient and allows it to generalize patterns learned at one point in a sequence to other points, regardless of the sequence length. The network learns a single set of rules for how to update its state and produce output based on the current input and its memory.
Here's a visual representation of the calculations within a single RNN cell:
Computational flow within a simple RNN cell. Inputs xt and ht−1 are transformed by weight matrices Wxh and Whh respectively, summed with bias bh, passed through activation f to produce ht. Then ht is transformed by Why, summed with bias by, and passed through activation g to produce yt.
These equations define the forward propagation of information through the RNN cell for a single time step. During training, the network's parameters (Wxh, Whh, Why, bh, by) are adjusted to minimize a loss function based on the predicted outputs yt compared to the true target values. This learning process involves calculating gradients, which, in the context of RNNs, requires a technique called Backpropagation Through Time (BPTT), the topic we will address next.
© 2025 ApX Machine Learning