When transitioning from Keras to PyTorch, you'll find that the fundamental building blocks of neural networks, layers, have direct counterparts. While the underlying principles are similar, the naming conventions, parameterization, and some default behaviors can differ. Here are implementations of common layer types like Dense (Linear), Convolutional (Conv2D), and Recurrent (LSTM) layers in PyTorch, drawing comparisons to their Keras equivalents.
tf.keras.layers.Dense vs torch.nn.LinearThe most basic layer, a fully connected or dense layer, performs a linear transformation (). In Keras, this is tf.keras.layers.Dense. In PyTorch, it's torch.nn.Linear.
Differences and Similarities:
units parameter to define the dimensionality of the output space. PyTorch uses out_features.torch.nn.Linear requires you to specify in_features, the dimensionality of the input. Keras often infers this from the input shape when the model is first called (or if an input_shape is provided to the first layer).Dense layer (e.g., activation='relu'). In PyTorch, activation functions are typically applied as separate modules (e.g., torch.nn.ReLU()) or functions from torch.nn.functional after the linear layer.use_bias=True (default) to include a bias term. PyTorch uses bias=True (default).Here's a comparison of common parameters:
Keras (tf.keras.layers.Dense) |
PyTorch (torch.nn.Linear) |
Description |
|---|---|---|
units |
out_features |
Size of the output |
(inferred or input_shape) |
in_features |
Size of the input |
activation |
(Applied separately) | Activation function |
use_bias |
bias |
Whether to include a bias term |
kernel_initializer |
(Handled differently) | Weight initialization strategy |
bias_initializer |
(Handled differently) | Bias initialization strategy |
Example:
Let's create a dense layer that takes 64 input features and produces 128 output features.
TensorFlow (Keras):
import tensorflow as tf
# Keras Dense layer
keras_dense_layer = tf.keras.layers.Dense(units=128, input_shape=(64,), activation='relu')
# Example usage with dummy data
dummy_input_keras = tf.random.normal(shape=(32, 64)) # Batch size 32, 64 features
output_keras = keras_dense_layer(dummy_input_keras)
print("Keras Output Shape:", output_keras.shape)
PyTorch:
import torch
import torch.nn as nn
# PyTorch Linear layer
pytorch_linear_layer = nn.Linear(in_features=64, out_features=128)
pytorch_relu = nn.ReLU()
# Example usage with dummy data
dummy_input_pytorch = torch.randn(32, 64) # Batch size 32, 64 features
linear_output_pytorch = pytorch_linear_layer(dummy_input_pytorch)
output_pytorch = pytorch_relu(linear_output_pytorch) # Apply activation separately
print("PyTorch Output Shape:", output_pytorch.shape)
In the PyTorch example, nn.ReLU() is instantiated as a module. You could also use torch.nn.functional.relu(). The weights and biases are automatically initialized in PyTorch, but you can customize this, as discussed in the "Weight Initialization Strategies" section.
tf.keras.layers.Conv2D vs torch.nn.Conv2dConvolutional layers are fundamental to computer vision tasks. Keras provides tf.keras.layers.Conv2D for 2D convolutions, while PyTorch offers torch.nn.Conv2d.
Differences and Similarities:
filters to specify the number of output channels (depth of the convolution). PyTorch uses out_channels.torch.nn.Conv2d requires in_channels to be specified. Keras typically infers this.kernel_size (can be an int or a tuple for asymmetric kernels).strides (a tuple, e.g., (1, 1)). PyTorch uses stride (an int or a tuple).'valid' (no padding) or 'same' (padding to maintain input spatial dimensions). PyTorch's padding argument can take an integer (symmetrical padding), a tuple (for explicit padding on each side), or string values 'valid' or 'same' (similar to Keras, though numerical padding offers more control).tf.keras.layers.Conv2D) defaults to 'channels_last' data format, meaning input tensors are expected in the shape (batch_size, height, width, channels).torch.nn.Conv2d) expects 'channels_first' data format: (batch_size, channels, height, width). You need to ensure your input data adheres to this format.Here's a parameter comparison:
Keras (tf.keras.layers.Conv2D) |
PyTorch (torch.nn.Conv2d) |
Description |
|---|---|---|
filters |
out_channels |
Number of output filters/channels |
(inferred or input_shape) |
in_channels |
Number of input channels |
kernel_size |
kernel_size |
Size of the convolution kernel |
strides |
stride |
Step size of the convolution |
padding |
padding |
Padding mode or amount |
data_format |
(Implicitly 'channels_first') | Tensor data format |
activation |
(Applied separately) | Activation function |
use_bias |
bias |
Whether to include a bias term |
Example:
A 2D convolutional layer with 32 output filters, a 3x3 kernel, and stride 1. Assume input images are grayscale (1 channel).
TensorFlow (Keras):
import tensorflow as tf
# Keras Conv2D layer
# Input: (batch, height, width, channels) e.g., (N, 28, 28, 1)
keras_conv_layer = tf.keras.layers.Conv2D(filters=32, kernel_size=(3, 3), strides=(1, 1), padding='same', activation='relu', input_shape=(28, 28, 1))
# Example usage
dummy_input_keras = tf.random.normal(shape=(32, 28, 28, 1)) # N, H, W, C
output_keras = keras_conv_layer(dummy_input_keras)
print("Keras Conv2D Output Shape:", output_keras.shape) # (32, 28, 28, 32) due to 'same' padding
PyTorch:
import torch
import torch.nn as nn
# PyTorch Conv2d layer
# Input: (batch, channels, height, width) e.g., (N, 1, 28, 28)
pytorch_conv_layer = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, stride=1, padding=1) # padding=1 for 3x3 kernel achieves "same" effect
pytorch_relu = nn.ReLU()
# Example usage
dummy_input_pytorch = torch.randn(32, 1, 28, 28) # N, C, H, W
conv_output_pytorch = pytorch_conv_layer(dummy_input_pytorch)
output_pytorch = pytorch_relu(conv_output_pytorch)
print("PyTorch Conv2d Output Shape:", output_pytorch.shape) # (32, 32, 28, 28)
Note on PyTorch
paddingforConv2d: To achieve 'same' padding behavior as in Keras (where output spatial dimensions match input forstride=1), if yourkernel_sizeis , you generally setpadding = (k-1) // 2for odd kernel sizes. For akernel_size=3,padding=1. Forkernel_size=5,padding=2. PyTorch also accepts string valuespadding='same'in recent versions, simplifying this.
tf.keras.layers.LSTM vs torch.nn.LSTMFor sequence modeling, Long Short-Term Memory (LSTM) networks are a popular choice. Keras provides tf.keras.layers.LSTM, and PyTorch offers torch.nn.LSTM.
Differences and Similarities:
units for the dimensionality of the hidden state (and output state if return_sequences=True). PyTorch uses hidden_size.torch.nn.LSTM requires input_size, which is the number of features in the input sequence at each time step.(batch_size, timesteps, features) format.torch.nn.LSTM defaults to batch_first=False, meaning it expects input as (timesteps, batch_size, features). You can set batch_first=True to use the more common (batch_size, timesteps, features) format. This is a frequent point of attention for developers transitioning.LSTM has return_sequences (to return the full sequence of outputs) and return_state (to return the final hidden and cell states).torch.nn.LSTM forward method always returns output, (h_n, c_n).
output: Contains the output features from the last LSTM layer for each time step. Its shape depends on batch_first. If batch_first=True, shape is (batch_size, seq_len, num_directions * hidden_size).h_n: Contains the final hidden state for each element in the batch. Shape: (num_layers * num_directions, batch_size, hidden_size).c_n: Contains the final cell state for each element in the batch. Shape: (num_layers * num_directions, batch_size, hidden_size).num_layers parameter allows easy stacking of LSTMs. In Keras, you'd stack LSTM layers sequentially.Example:
An LSTM layer with 128 hidden units, processing sequences of length 10 with 20 features per time step.
TensorFlow (Keras):
import tensorflow as tf
# Keras LSTM layer
# Input shape: (batch_size, timesteps, features)
keras_lstm_layer = tf.keras.layers.LSTM(units=128, return_sequences=True, input_shape=(10, 20))
# Example usage
dummy_input_keras = tf.random.normal(shape=(32, 10, 20)) # batch, timesteps, features
output_keras = keras_lstm_layer(dummy_input_keras)
print("Keras LSTM Output Shape (sequences):", output_keras.shape)
keras_lstm_layer_last_step = tf.keras.layers.LSTM(units=128, return_sequences=False)
output_keras_last = keras_lstm_layer_last_step(dummy_input_keras)
print("Keras LSTM Output Shape (last step):", output_keras_last.shape)
PyTorch:
import torch
import torch.nn as nn
# PyTorch LSTM layer
# input_size: features per timestep
# hidden_size: LSTM units
pytorch_lstm_layer = nn.LSTM(input_size=20, hidden_size=128, num_layers=1, batch_first=True)
# Example usage
dummy_input_pytorch = torch.randn(32, 10, 20) # batch, timesteps, features (due to batch_first=True)
output_pytorch, (h_n, c_n) = pytorch_lstm_layer(dummy_input_pytorch)
print("PyTorch LSTM Full Output Shape:", output_pytorch.shape) # (batch_size, seq_len, hidden_size)
print("PyTorch LSTM Final Hidden State Shape (h_n):", h_n.shape) # (num_layers, batch_size, hidden_size)
print("PyTorch LSTM Final Cell State Shape (c_n):", c_n.shape) # (num_layers, batch_size, hidden_size)
Important: Remember the
batch_first=Trueargument in PyTorch'snn.LSTMif your data is structured as(batch, sequence, feature), which is common. Without it, PyTorch expects(sequence, batch, feature). The shapes ofh_nandc_nare(num_layers * num_directions, batch, hidden_size), so for a single-layer, non-bidirectional LSTM, it's(1, batch, hidden_size). You might need tosqueeze()the first dimension if you need just(batch, hidden_size).
Many other layers have straightforward translations:
Pooling Layers:
tf.keras.layers.MaxPool2D, tf.keras.layers.AvgPool2Dtorch.nn.MaxPool2d, torch.nn.AvgPool2dpool_size (Keras) map to kernel_size (PyTorch). strides and padding behave similarly to convolutional layers. Remember the channels_first data format for PyTorch 2D pooling layers.Dropout Layers:
tf.keras.layers.Dropout(rate)torch.nn.Dropout(p)rate in Keras and p in PyTorch both represent the probability of an element to be zeroed out during training.Flatten Layers:
tf.keras.layers.Flatten()torch.nn.Flatten(start_dim=1, end_dim=-1)Flatten is more flexible; start_dim=1 is common to flatten all dimensions except the batch dimension, similar to Keras's default.Batch Normalization:
tf.keras.layers.BatchNormalization(axis=-1, ...) (axis usually channels)torch.nn.BatchNorm1d(num_features), torch.nn.BatchNorm2d(num_features), torch.nn.BatchNorm3d(num_features)num_features corresponds to the number of channels for BatchNorm2d (if data is N, C, H, W) or features for BatchNorm1d (if data is N, L or N, C, L).By understanding these mappings and paying attention to details like input shapes and parameter names, you can effectively translate your Keras layer knowledge to build models using torch.nn. The next step is to assemble these layers into complete model architectures.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with