Recurrent Neural Networks (RNNs), including their more sophisticated variants like LSTMs and GRUs, are designed around the principle of processing sequences element by element. At each time step , an RNN takes the input and the hidden state from the previous time step to compute the current hidden state . This process can be represented abstractly as:
where represents the recurrent function (e.g., involving matrix multiplications and activation functions like tanh or sigmoid, or the more complex gating logic of LSTMs/GRUs) parameterized by weights . An optional output can also be generated at each time step, often derived from .
This formulation reveals the core characteristic of recurrent models: an inherent sequential dependency. The computation of the hidden state at time step must wait for the completion of the computation at time step . This dependency forms a chain stretching across the entire sequence length.
An unrolled RNN illustrating the sequential flow of information. The hidden state depends directly on the previous state and the current input .
While the computations within a single time step (e.g., matrix multiplications inside ) can often leverage parallel hardware like GPUs, the computation across time steps cannot be parallelized. You cannot compute and simultaneously because requires as input.
This sequential constraint has significant performance implications:
This contrasts sharply with architectures like feed-forward networks or convolutional neural networks (CNNs) applied to sequences (e.g., 1D convolutions). In those models, computations for different parts of the input sequence can often be performed independently and in parallel, leading to greater efficiency, particularly on specialized hardware.
While LSTMs and GRUs introduce sophisticated gating mechanisms to better control information flow and address gradient problems (discussed next), they fundamentally adhere to this same sequential processing approach. The calculation of gates (input, forget, output) and cell states at time still relies on the hidden state and cell state from time .
Therefore, the inability to parallelize computations across the sequence length dimension is a fundamental bottleneck inherent in the core design of recurrent networks. This limitation was a primary motivator for exploring alternative architectures, like the Transformer, capable of processing sequence elements more concurrently.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with