When transitioning from Keras to PyTorch, you'll find that the fundamental building blocks of neural networks, layers, have direct counterparts. While the underlying principles are similar, the naming conventions, parameterization, and some default behaviors can differ. This section will guide you through implementing common layer types like Dense (Linear), Convolutional (Conv2D), and Recurrent (LSTM) layers in PyTorch, drawing comparisons to their Keras equivalents.

Dense Layers: `tf.keras.layers.Dense` vs `torch.nn.Linear`

The most basic layer, a fully connected or dense layer, performs a linear transformation ( $y = Wx + b$ ). In Keras, this is tf.keras.layers.Dense. In PyTorch, it's torch.nn.Linear.

Differences and Similarities:

Output Units: Keras uses the units parameter to define the dimensionality of the output space. PyTorch uses out_features.
Input Features: PyTorch's torch.nn.Linear requires you to specify in_features, the dimensionality of the input. Keras often infers this from the input shape when the model is first called (or if an input_shape is provided to the first layer).
Activation Functions: In Keras, an activation function can be directly passed as an argument to the Dense layer (e.g., activation='relu'). In PyTorch, activation functions are typically applied as separate modules (e.g., torch.nn.ReLU()) or functions from torch.nn.functional after the linear layer.
Bias: Keras uses use_bias=True (default) to include a bias term. PyTorch uses bias=True (default).

Here's a comparison of common parameters:

Keras (`tf.keras.layers.Dense`)	PyTorch (`torch.nn.Linear`)	Description
`units`	`out_features`	Size of the output
(inferred or `input_shape`)	`in_features`	Size of the input
`activation`	(Applied separately)	Activation function
`use_bias`	`bias`	Whether to include a bias term
`kernel_initializer`	(Handled differently)	Weight initialization strategy
`bias_initializer`	(Handled differently)	Bias initialization strategy

Example:

Let's create a dense layer that takes 64 input features and produces 128 output features.

TensorFlow (Keras):

import tensorflow as tf

# Keras Dense layer
keras_dense_layer = tf.keras.layers.Dense(units=128, input_shape=(64,), activation='relu')

# Example usage with dummy data
dummy_input_keras = tf.random.normal(shape=(32, 64)) # Batch size 32, 64 features
output_keras = keras_dense_layer(dummy_input_keras)
print("Keras Output Shape:", output_keras.shape)

PyTorch:

import torch
import torch.nn as nn

# PyTorch Linear layer
pytorch_linear_layer = nn.Linear(in_features=64, out_features=128)
pytorch_relu = nn.ReLU()

# Example usage with dummy data
dummy_input_pytorch = torch.randn(32, 64) # Batch size 32, 64 features
linear_output_pytorch = pytorch_linear_layer(dummy_input_pytorch)
output_pytorch = pytorch_relu(linear_output_pytorch) # Apply activation separately
print("PyTorch Output Shape:", output_pytorch.shape)

In the PyTorch example, nn.ReLU() is instantiated as a module. You could also use torch.nn.functional.relu(). The weights and biases are automatically initialized in PyTorch, but you can customize this, as discussed in the "Weight Initialization Strategies" section.

Convolutional Layers: `tf.keras.layers.Conv2D` vs `torch.nn.Conv2d`

Convolutional layers are fundamental to computer vision tasks. Keras provides tf.keras.layers.Conv2D for 2D convolutions, while PyTorch offers torch.nn.Conv2d.

Differences and Similarities:

Output Channels/Filters: Keras uses filters to specify the number of output channels (depth of the convolution). PyTorch uses out_channels.
Input Channels: PyTorch's torch.nn.Conv2d requires in_channels to be specified. Keras typically infers this.
Kernel Size: Both use kernel_size (can be an int or a tuple for asymmetric kernels).
Strides: Keras uses strides (a tuple, e.g., (1, 1)). PyTorch uses stride (an int or a tuple).
Padding: Keras uses a string: 'valid' (no padding) or 'same' (padding to maintain input spatial dimensions). PyTorch's padding argument can take an integer (symmetrical padding), a tuple (for explicit padding on each side), or string values 'valid' or 'same' (similar to Keras, though numerical padding offers more control).
Data Format: This is a significant difference.
- Keras (tf.keras.layers.Conv2D) defaults to 'channels_last' data format, meaning input tensors are expected in the shape (batch_size, height, width, channels).
- PyTorch (torch.nn.Conv2d) expects 'channels_first' data format: (batch_size, channels, height, width). You need to ensure your input data adheres to this format.

Here's a parameter comparison:

Keras (`tf.keras.layers.Conv2D`)	PyTorch (`torch.nn.Conv2d`)	Description
`filters`	`out_channels`	Number of output filters/channels
(inferred or `input_shape`)	`in_channels`	Number of input channels
`kernel_size`	`kernel_size`	Size of the convolution kernel
`strides`	`stride`	Step size of the convolution
`padding`	`padding`	Padding mode or amount
`data_format`	(Implicitly 'channels_first')	Tensor data format
`activation`	(Applied separately)	Activation function
`use_bias`	`bias`	Whether to include a bias term

Example:

A 2D convolutional layer with 32 output filters, a 3x3 kernel, and stride 1. Assume input images are grayscale (1 channel).

TensorFlow (Keras):

import tensorflow as tf

# Keras Conv2D layer
# Input: (batch, height, width, channels) e.g., (N, 28, 28, 1)
keras_conv_layer = tf.keras.layers.Conv2D(filters=32, kernel_size=(3, 3), strides=(1, 1), padding='same', activation='relu', input_shape=(28, 28, 1))

# Example usage
dummy_input_keras = tf.random.normal(shape=(32, 28, 28, 1)) # N, H, W, C
output_keras = keras_conv_layer(dummy_input_keras)
print("Keras Conv2D Output Shape:", output_keras.shape) # (32, 28, 28, 32) due to 'same' padding

PyTorch:

import torch
import torch.nn as nn

# PyTorch Conv2d layer
# Input: (batch, channels, height, width) e.g., (N, 1, 28, 28)
pytorch_conv_layer = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, stride=1, padding=1) # padding=1 for 3x3 kernel achieves "same" effect
pytorch_relu = nn.ReLU()

# Example usage
dummy_input_pytorch = torch.randn(32, 1, 28, 28) # N, C, H, W
conv_output_pytorch = pytorch_conv_layer(dummy_input_pytorch)
output_pytorch = pytorch_relu(conv_output_pytorch)
print("PyTorch Conv2d Output Shape:", output_pytorch.shape) # (32, 32, 28, 28)

Note on PyTorch padding for Conv2d: To achieve 'same' padding behavior as in Keras (where output spatial dimensions match input for stride=1), if your kernel_size is $k$ , you generally set padding = (k-1) // 2 for odd kernel sizes. For a kernel_size=3, padding=1. For kernel_size=5, padding=2. PyTorch also accepts string values padding='same' in recent versions, simplifying this.

Recurrent Layers: `tf.keras.layers.LSTM` vs `torch.nn.LSTM`

For sequence modeling, Long Short-Term Memory (LSTM) networks are a popular choice. Keras provides tf.keras.layers.LSTM, and PyTorch offers torch.nn.LSTM.

Differences and Similarities:

Hidden Units: Keras uses units for the dimensionality of the hidden state (and output state if return_sequences=True). PyTorch uses hidden_size.
Input Features: PyTorch's torch.nn.LSTM requires input_size, which is the number of features in the input sequence at each time step.
Batch First:
- Keras LSTMs typically expect input in (batch_size, timesteps, features) format.
- PyTorch's torch.nn.LSTM defaults to batch_first=False, meaning it expects input as (timesteps, batch_size, features). You can set batch_first=True to use the more common (batch_size, timesteps, features) format. This is a frequent point of attention for developers transitioning.
Return Values:
- Keras LSTM has return_sequences (to return the full sequence of outputs) and return_state (to return the final hidden and cell states).
- PyTorch's torch.nn.LSTM forward method always returns output, (h_n, c_n).
  - output: Contains the output features from the last LSTM layer for each time step. Its shape depends on batch_first. If batch_first=True, shape is (batch_size, seq_len, num_directions * hidden_size).
  - h_n: Contains the final hidden state for each element in the batch. Shape: (num_layers * num_directions, batch_size, hidden_size).
  - c_n: Contains the final cell state for each element in the batch. Shape: (num_layers * num_directions, batch_size, hidden_size).
Number of Layers: PyTorch's num_layers parameter allows easy stacking of LSTMs. In Keras, you'd stack LSTM layers sequentially.

Example:

An LSTM layer with 128 hidden units, processing sequences of length 10 with 20 features per time step.

TensorFlow (Keras):

import tensorflow as tf

# Keras LSTM layer
# Input shape: (batch_size, timesteps, features)
keras_lstm_layer = tf.keras.layers.LSTM(units=128, return_sequences=True, input_shape=(10, 20))

# Example usage
dummy_input_keras = tf.random.normal(shape=(32, 10, 20)) # batch, timesteps, features
output_keras = keras_lstm_layer(dummy_input_keras)
print("Keras LSTM Output Shape (sequences):", output_keras.shape)

keras_lstm_layer_last_step = tf.keras.layers.LSTM(units=128, return_sequences=False)
output_keras_last = keras_lstm_layer_last_step(dummy_input_keras)
print("Keras LSTM Output Shape (last step):", output_keras_last.shape)

PyTorch:

import torch
import torch.nn as nn

# PyTorch LSTM layer
# input_size: features per timestep
# hidden_size: LSTM units
pytorch_lstm_layer = nn.LSTM(input_size=20, hidden_size=128, num_layers=1, batch_first=True)

# Example usage
dummy_input_pytorch = torch.randn(32, 10, 20) # batch, timesteps, features (due to batch_first=True)
output_pytorch, (h_n, c_n) = pytorch_lstm_layer(dummy_input_pytorch)

print("PyTorch LSTM Full Output Shape:", output_pytorch.shape) # (batch_size, seq_len, hidden_size)
print("PyTorch LSTM Final Hidden State Shape (h_n):", h_n.shape) # (num_layers, batch_size, hidden_size)
print("PyTorch LSTM Final Cell State Shape (c_n):", c_n.shape) # (num_layers, batch_size, hidden_size)

Important: Remember the batch_first=True argument in PyTorch's nn.LSTM if your data is structured as (batch, sequence, feature), which is common. Without it, PyTorch expects (sequence, batch, feature). The shapes of h_n and c_n are (num_layers * num_directions, batch, hidden_size), so for a single-layer, non-bidirectional LSTM, it's (1, batch, hidden_size). You might need to squeeze() the first dimension if you need just (batch, hidden_size).

Other Common Layers

Many other layers have straightforward translations:

Pooling Layers:
- Keras: tf.keras.layers.MaxPool2D, tf.keras.layers.AvgPool2D
- PyTorch: torch.nn.MaxPool2d, torch.nn.AvgPool2d
- Parameters like pool_size (Keras) map to kernel_size (PyTorch). strides and padding behave similarly to convolutional layers. Remember the channels_first data format for PyTorch 2D pooling layers.
Dropout Layers:
- Keras: tf.keras.layers.Dropout(rate)
- PyTorch: torch.nn.Dropout(p)
- The rate in Keras and p in PyTorch both represent the probability of an element to be zeroed out during training.
Flatten Layers:
- Keras: tf.keras.layers.Flatten()
- PyTorch: torch.nn.Flatten(start_dim=1, end_dim=-1)
- PyTorch's Flatten is more flexible; start_dim=1 is common to flatten all dimensions except the batch dimension, similar to Keras's default.
Batch Normalization:
- Keras: tf.keras.layers.BatchNormalization(axis=-1, ...) (axis usually channels)
- PyTorch: torch.nn.BatchNorm1d(num_features), torch.nn.BatchNorm2d(num_features), torch.nn.BatchNorm3d(num_features)
- In PyTorch, you choose the BatchNorm variant based on the input dimensionality. num_features corresponds to the number of channels for BatchNorm2d (if data is N, C, H, W) or features for BatchNorm1d (if data is N, L or N, C, L).

By understanding these mappings and paying attention to details like input shapes and parameter names, you can effectively translate your Keras layer knowledge to build models using torch.nn. The next step is to assemble these layers into complete model architectures.

Common Layer Types: A Comparative Implementation

Dense Layers: tf.keras.layers.Dense vs torch.nn.Linear

Convolutional Layers: tf.keras.layers.Conv2D vs torch.nn.Conv2d

Recurrent Layers: tf.keras.layers.LSTM vs torch.nn.LSTM

Other Common Layers

Dense Layers: `tf.keras.layers.Dense` vs `torch.nn.Linear`

Convolutional Layers: `tf.keras.layers.Conv2D` vs `torch.nn.Conv2d`

Recurrent Layers: `tf.keras.layers.LSTM` vs `torch.nn.LSTM`