One of the motivations behind the development of Gated Recurrent Units (GRUs) was to create a gated recurrent unit that retains the ability to manage long-range dependencies, similar to LSTMs, but with a simpler structure and consequently, greater computational efficiency. This efficiency stems primarily from the GRU's reduced complexity compared to the LSTM cell. Let's examine the factors contributing to this difference.
The most direct contributor to GRU's efficiency is its lower parameter count compared to an LSTM cell with the same hidden state size. Recall the structures:
Let d be the input feature dimension and h be the hidden state dimension.
For an LSTM cell, the approximate number of parameters (weights and biases) is: ParamsLSTM≈4×(h×(d+h)+h) The '4' comes from the four transformations (input gate, forget gate, output gate, candidate cell state). Each transformation involves weights mapping the concatenated input xt (dimension d) and previous hidden state ht−1 (dimension h) to the hidden dimension h, plus a bias vector of dimension h.
For a GRU cell, the approximate number of parameters is: ParamsGRU≈3×(h×(d+h)+h) The '3' comes from the three transformations (reset gate, update gate, candidate hidden state). The structure is similar, but with one fewer gate-related transformation compared to LSTM.
This means a GRU cell typically has about 25% fewer parameters than an LSTM cell with the same hidden state size.
Comparison of the approximate number of parameter sets involved in the core transformations within LSTM and GRU cells.
Fewer parameters directly translate to fewer computations required per time step. The core operations in both LSTMs and GRUs involve matrix multiplications (between inputs/hidden states and weight matrices) and element-wise operations (for gate activations and state updates).
Since a GRU performs three main matrix multiplication steps per time step compared to LSTM's four, it requires fewer floating-point operations (FLOPs). This reduction applies to both the forward pass (calculating hidden states) and the backward pass (calculating gradients during training via Backpropagation Through Time).
While the exact speedup depends on the specific hardware, software implementations (like cuDNN optimizations), and model dimensions, GRUs generally execute faster per time step.
It is important to remember that computational efficiency is only one aspect of model selection. While GRUs are often faster and lighter than LSTMs, LSTMs, with their distinct cell state and output gate, sometimes offer slightly better performance on tasks requiring the modeling of particularly intricate or long-range dependencies. The choice between GRU and LSTM often involves an empirical evaluation, balancing the need for computational resources against the desired predictive performance, as we will discuss further in the next section.
© 2025 ApX Machine Learning