Common parameters for LSTM or GRU layers are essential to configure for tailoring these layers to specific tasks when building models with framework APIs like TensorFlow/Keras or PyTorch. Correctly setting these configurations is important for building effective sequence models.
The most fundamental parameter for both LSTM and GRU layers is typically called units (in Keras/TensorFlow) or hidden_size (in PyTorch). This integer value determines the dimensionality of the hidden state and, for LSTMs, also the cell state .
# TensorFlow/Keras Example
lstm_layer = tf.keras.layers.LSTM(units=64)
gru_layer = tf.keras.layers.GRU(units=32)
# PyTorch Example
# Note: input_size must also be specified here
lstm_layer_pytorch = torch.nn.LSTM(input_size=10, hidden_size=64, batch_first=True)
gru_layer_pytorch = torch.nn.GRU(input_size=10, hidden_size=32, batch_first=True)
Think of units as the number of memory cells or neurons within the recurrent layer. A higher number of units allows the layer to potentially capture more complex patterns and dependencies in the sequence data, increasing the model's representational capacity. However, just like in feedforward networks, too many units can lead to overfitting, especially with smaller datasets, and increase computational cost significantly. Finding the right number often involves experimentation and validation.
Recurrent layers like LSTMs and GRUs involve several internal computations, many of which use activation functions. Frameworks typically allow you to configure two main activations:
activation: This function is applied to the cell state update (in LSTMs) and the candidate hidden state calculation. The default and most common choice is the hyperbolic tangent function, tanh. Its output range of helps regulate the values within the network.
recurrent_activation: This function is applied to the gates (forget gate , input gate , output gate in LSTM; reset gate , update gate in GRU). The standard choice is the sigmoid function, , often referred to as sigmoid or sometimes hard_sigmoid (a computationally cheaper approximation). The sigmoid function outputs values between 0 and 1, making it ideal for gating mechanisms. A value close to 1 means "let the information pass," while a value close to 0 means "block the information."
While you can technically change these defaults, tanh for the main activation and sigmoid (or hard_sigmoid) for the recurrent (gate) activations are standard practice and generally work well.
# TensorFlow/Keras Example with custom activations (less common)
lstm_layer = tf.keras.layers.LSTM(
units=128,
activation='relu', # Non-standard choice for main activation
recurrent_activation='sigmoid' # Standard choice for gates
)
# PyTorch: Activations are often implicitly tanh/sigmoid within the nn.LSTM/nn.GRU modules.
# Customizing them might require implementing the cell logic manually.
return_sequencesA critical parameter is return_sequences. It's a boolean flag that determines the shape of the layer's output.
return_sequences=False (Default): The layer only outputs the hidden state from the final time step. If your input has shape (batch_size, time_steps, features), the output will have shape (batch_size, units). This is common when the recurrent layer is the last layer before a final Dense layer for classification or regression on the entire sequence (e.g., sentiment analysis).
return_sequences=True: The layer outputs the hidden state for every time step. The output shape will be (batch_size, time_steps, units). This is necessary when:
# TensorFlow/Keras Example
# Layer outputs only the last hidden state (shape: batch_size, 64)
lstm_last_state = tf.keras.layers.LSTM(units=64, return_sequences=False)
# Layer outputs hidden states for all time steps (shape: batch_size, time_steps, 128)
lstm_all_states = tf.keras.layers.LSTM(units=128, return_sequences=True)
# PyTorch Example
# By default, nn.LSTM/nn.GRU return outputs for all steps and the final hidden/cell states.
# You often select the part you need from the tuple they return.
lstm_pytorch = torch.nn.LSTM(input_size=10, hidden_size=64, batch_first=True)
# output, (hn, cn) = lstm_pytorch(input_tensor)
# 'output' contains all hidden states (batch, seq_len, hidden_size)
# 'hn' contains the final hidden state (num_layers * num_directions, batch, hidden_size)
Impact of
return_sequenceson the output shape of an LSTM or GRU layer.
return_stateAnother boolean parameter, return_state, controls whether the layer returns the final hidden state(s) in addition to the sequence output.
return_state=False (Default): The layer only returns the output sequence (either the last state or all states, depending on return_sequences).return_state=True: The layer returns a list (in Keras) or tuple (in PyTorch) containing the sequence output and the final state(s).
[final_hidden_state].[final_hidden_state, final_cell_state].This is particularly useful in encoder-decoder architectures (covered later) where the final state of the encoder is used to initialize the state of the decoder. It can also be helpful for analyzing the final learned representation of the sequence.
# TensorFlow/Keras Example
lstm_layer = tf.keras.layers.LSTM(units=32, return_sequences=True, return_state=True)
# When called, output is: [all_hidden_states, final_hidden_state, final_cell_state]
gru_layer = tf.keras.layers.GRU(units=64, return_sequences=False, return_state=True)
# When called, output is: [last_hidden_state, final_hidden_state]
# Note: When return_sequences=False, the first two elements are the same.
Sequential), you typically need to specify the shape of the input per instance, excluding the batch dimension. This is often done via the input_shape argument, which expects a tuple like (time_steps, features). For subsequent layers, the framework usually infers the input shape automatically. PyTorch requires input_size (number of features) during layer initialization.dropout (applied to the input/output units) and recurrent_dropout (applied to the recurrent connections, i.e., the hidden state updates). These apply dropout masks consistently across time steps for the recurrent connections, which is important for training stability.kernel_initializer, recurrent_initializer, bias_initializer). Standard initializers like Glorot (Xavier) uniform or Orthogonal are often good starting points.Understanding these parameters allows you to effectively construct and customize LSTM and GRU layers for your sequence modeling needs. Remember that the optimal configuration often depends on the specific dataset and task, requiring some degree of experimentation. In the following sections, we'll see how to combine these layers into more complex architectures like stacked and bidirectional RNNs.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with