Dropout introduces randomness directly into the network's architecture during training. Unlike weight regularization which modifies the loss function, Dropout modifies the network itself on each forward pass.
The core idea is straightforward: for each training example (or mini-batch), temporarily remove units (along with their connections) from the network with a certain probability p. Each unit has an independent chance p of being dropped, meaning its output is set to zero for that specific forward and backward pass.
Imagine a layer in your network with N units. During a training iteration's forward pass, we generate a random binary mask, let's call it m, of the same size as the layer's output. Each element mi in this mask is drawn from a Bernoulli distribution: it's 1 with probability 1−p (the "keep probability") and 0 with probability p (the "dropout probability").
mi∼Bernoulli(1−p)If a unit's corresponding mask value mi is 1, the unit operates as usual. If mi is 0, the unit's output is forced to zero for this specific forward pass. This mask is then applied element-wise to the layer's activation output, let's say a. The result, adropout, is then passed to the next layer.
adropout=a⊙m(where ⊙ is element-wise multiplication)Crucially, a different mask m is generated for every training sample or mini-batch presented to the network. This constant "thinning" of the network prevents units from becoming overly specialized or dependent on the presence of specific other units.
A layer with four units undergoing Dropout during two different training steps. The grayed-out units with dashed outlines are randomly set to zero output (dropped) for that specific step. Notice the set of dropped units changes between steps.
This process has a significant effect:
During the backward pass (backpropagation), the gradients are only computed and propagated through the units that were not dropped (where mi=1). The units whose outputs were set to zero do not contribute to the gradient calculation for that particular training step, effectively being temporarily removed from the gradient computation path as well.
This constant shuffling and disabling of units during training is the key mechanism by which Dropout helps prevent overfitting. However, it also means we need a consistent way to use the full network during inference or testing, which we will address in the next section on activation scaling.
© 2025 ApX Machine Learning