Okay, let's think about what happens when we're done training our model with dropout and want to actually use it to make predictions on new data (inference or test time).

During training, dropout introduces randomness by setting neuron activations to zero with probability $p$ . This means, on average, only a fraction $(1-p)$ of the neurons in a layer contribute to the output passed to the next layer. Consequently, the overall magnitude or scale of the activations flowing forward is reduced compared to running the network without dropout.

At test time, however, we want our model to be deterministic. Running the same input through the network should always produce the same output. Randomly dropping neurons during inference would violate this. Therefore, at test time, we use the entire network – all neurons are active.

But now we have a mismatch: the network was trained with activations that were, on average, smaller in scale due to dropout. If we suddenly use all neurons at full strength during testing, the activations passed to subsequent layers will be significantly larger than what the network experienced during training. This difference in scale can lead to poor performance, as the network isn't calibrated for these larger values.

The Scaling Solution

To address this discrepancy, we need to ensure the expected output of a neuron at test time matches its expected output during training. Let $a$ be the output activation of a neuron. During training, this neuron is active with probability $(1-p)$ . So, its expected contribution to the next layer is $(1-p) \times a$ .

At test time, the neuron is always active, producing output $a$ . To make the test-time output scale match the expected training-time scale, we simply scale the test-time activation down by the same factor, $(1-p)$ :

a_{\text{test}} = a_{\text{train}} \times (1-p)

By multiplying the activations of all neurons in the dropout layer by the keep probability $(1-p)$ during inference, we ensure that the input scale to the next layer remains consistent with what was observed, on average, during training.

Example: Visualizing the Scaling

Imagine a small layer with four neurons and a dropout probability $p=0.5$ (meaning a keep probability of $1-p=0.5$ ).

During training (left), a random subset of neurons is deactivated (shown grayed out). The expected output magnitude is reduced. At test time (right), all neurons are active, but their outputs are scaled by the keep probability $(1-p)$ to match the expected scale seen during training before passing to the next layer.

This scaling step is essential for dropout to function correctly. Without it, the network's behavior would differ significantly between training and testing phases.

A Note on Implementation

Performing this scaling operation during the test phase is one way to implement dropout correctly. However, it means that the inference code needs to be aware of the dropout probability used during training and perform this extra multiplication step.

A common alternative approach, known as "inverted dropout," performs the scaling during the training phase instead. This makes the test-time pass simpler, as no scaling is required then. We'll explore inverted dropout in the next section, as it's the standard implementation found in most deep learning frameworks. Understanding the test-time scaling requirement, however, provides the fundamental reason why scaling (at either training or test time) is necessary.

Was this section helpful?