Recurrent Neural Networks process sequence elements one at a time, updating a hidden state. We will now formalize the computations happening inside a simple RNN cell at each time step .
Recall that the cell receives two inputs: the current input feature vector from the sequence and the hidden state computed at the previous time step. Its goal is to compute the new hidden state , which captures information from the past up to the current step, and potentially an output relevant for this specific time step.
The core computations are typically defined by two equations:
Calculating the new hidden state ():
Calculating the output ():
Let's break down each component of these equations:
: This is the input vector at the current time step . If you're processing text, this could be an embedding vector for the word at position . If it's time series data, it might be the sensor readings at time . Let's say has dimension (i.e., ).
: This is the hidden state vector from the previous time step . It serves as the network's memory of the past. We assume the hidden state has a size (i.e., ). For the very first time step (), is typically initialized, often as a vector of zeros.
: This is the newly computed hidden state vector for the current time step . It also has size (i.e., ).
: This is the weight matrix that transforms the current input . It connects the input layer to the hidden layer. Its dimensions must be compatible with matrix multiplication: if is -dimensional and is -dimensional, then must have shape .
: This is the weight matrix that transforms the previous hidden state . It represents the recurrent connection from the hidden layer at the previous time step to the hidden layer at the current time step. Its dimensions must connect an -dimensional vector () to another -dimensional vector (), so has shape .
: This is the bias vector added to the hidden state calculation. It has the same dimension as the hidden state, (i.e., ).
: This is the activation function applied element-wise to compute the hidden state. In simple RNNs, this is often the hyperbolic tangent function (tanh). The tanh function squashes the values into the range [-1, 1], which helps regulate the activations flowing through the network. Other functions like ReLU could potentially be used, but tanh is traditional for basic RNN hidden states.
: This is the output vector computed at time step . Its dimension, let's call it (i.e., ), depends on the specific task. For instance, in language modeling, you might predict the next word, so would be the vocabulary size. In time series forecasting, might be 1 if you predict a single value. Note that an output might not be needed at every time step, depending on the application.
: This is the weight matrix that transforms the current hidden state to produce the output . It connects the hidden layer to the output layer. To transform an -dimensional hidden state to a -dimensional output, must have shape .
: This is the bias vector added to the output calculation. It has the same dimension as the output, (i.e., ).
: This is the activation function applied element-wise to compute the final output . The choice of depends heavily on the nature of the task.
softmax function.A significant aspect of RNNs is parameter sharing. The same set of weight matrices (, , ) and bias vectors (, ) are used to perform the calculations at every single time step. This makes the model computationally efficient and allows it to generalize patterns learned at one point in a sequence to other points, regardless of the sequence length. The network learns a single set of rules for how to update its state and produce output based on the current input and its memory.
Here's a visual representation of the calculations within a single RNN cell:
Computational flow within a simple RNN cell. Inputs and are transformed by weight matrices and respectively, summed with bias , passed through activation to produce . Then is transformed by , summed with bias , and passed through activation to produce .
These equations define the forward propagation of information through the RNN cell for a single time step. During training, the network's parameters (, , , , ) are adjusted to minimize a loss function based on the predicted outputs compared to the true target values. This learning process involves calculating gradients, which, in the context of RNNs, requires a technique called Backpropagation Through Time (BPTT), the topic we will address next.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•