While the standard dropout procedure involves randomly dropping units during training and then scaling down the activations by the keep probability $p$ during testing, this introduces a slight asymmetry between training and testing. The network's forward pass calculation changes depending on whether you're training or evaluating.

A common and elegant implementation technique called inverted dropout addresses this. Instead of scaling down activations at test time, inverted dropout performs the scaling during training.

Here's how it works:

Generate Dropout Mask: Just like standard dropout, during a training forward pass, a binary mask $m$ is created where each element corresponds to a neuron in the layer. Elements are set to 1 with probability $p$ (keep probability) and 0 with probability $1-p$ (drop probability).
Apply Mask: The mask is applied to the layer's activations $a$ : $a_{masked} = a * m$ .
Scale Up During Training: Here's the main difference. The resulting activations ( $a_{masked}$ ) are immediately scaled up by dividing by the keep probability $p$ : $a_{dropout} = \frac{a_{masked}}{p}$

Why does this work? By scaling up the activations of the kept neurons during training, we ensure that the expected value of the output from the layer remains the same as it would be without dropout. If a neuron is kept (with probability $p$ ), its output is scaled by $1/p$ . If it's dropped (with probability $1-p$ ), its output is 0. The expected output is therefore $p \times (\frac{a}{p}) + (1-p) \times 0 = a$ .

The Advantage of Inverted Dropout

The significant advantage of inverted dropout is that the forward pass during test time remains unchanged. You don't need to remember to scale the activations by $p$ . You simply use the network as is, effectively turning off the dropout mechanism (i.e., using a keep probability of 1). This simplifies the code required for inference and deployment significantly.

Most modern deep learning frameworks, including PyTorch and TensorFlow, implement dropout using the inverted dropout technique by default when you use their built-in dropout layers.

Implementation Example

In PyTorch, you typically add dropout as a layer in your network definition. The framework handles the inverted scaling automatically during training.

import torch
import torch.nn as nn

# Define keep probability (often 1 - drop_probability)
p_keep = 0.8 # Equivalent to a dropout probability of 0.2

# Define a dropout layer
dropout_layer = nn.Dropout(p=1.0 - p_keep) # nn.Dropout takes the drop probability

# Example activation tensor (e.g., from a previous layer)
# Batch size of 2, 10 features
activations = torch.randn(2, 10)

# --- During Training ---
# Set the model to training mode
dropout_layer.train()
output_train = dropout_layer(activations)

# Observe the output:
# - Some elements will be zero.
# - Non-zero elements will be scaled up by 1/p_keep (1 / 0.8 = 1.25)
print("Activations:\n", activations)
print("\nOutput during Training (Inverted Dropout):\n", output_train)
# Expected magnitude of non-zero elements is roughly 1.25 * original magnitude

# --- During Testing/Inference ---
# Set the model to evaluation mode
dropout_layer.eval()
output_test = dropout_layer(activations)

# Observe the output:
# - No elements are zeroed out.
# - No scaling is applied. The output equals the input.
print("\nOutput during Testing (Dropout inactive):\n", output_test)

# Check if output_test is the same as activations
print("\nIs test output same as input?", torch.allclose(output_test, activations))

As you can see from the example, calling model.train() enables dropout (with inverted scaling), and model.eval() disables it, making the layer pass through the input unmodified. This seamless handling is why inverted dropout is the standard implementation. You define the dropout layer once, and the framework manages its behavior based on the model's mode (training or evaluation).

Was this section helpful?