CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
different datasets by sampling from the training set with replacement, and then
train model
i
on dataset
i
. Dropout aims to approximate this process, but with an
exponentially large number of neural networks. Specifically, to train with dropout,
we use a minibatch-based learning algorithm that makes small steps, such as
stochastic gradient descent. Each time we load an example into a minibatch, we
randomly sample a different binary mask to apply to all the input and hidden
units in the network. The mask for each unit is sampled independently from all
the others. The probability of sampling a mask value of one (causing a unit to be
included) is a hyperparameter fixed before training begins. It is not a function
of the current value of the model parameters or the input example. Typically, an
input unit is included with probability 0.8, and a hidden unit is included with
probability 0.5. We then run forward propagation, back-propagation, and the
learning update as usual. Figure 7.7 illustrates how to run forward propagation
with dropout.
More formally, suppose that a mask vector
µ
specifies which units to include,
and
J
(
θ, µ
) defines the cost of the model defined by parameters
θ
and mask
µ
.
Then dropout training consists of minimizing
E
µ
J
(
θ, µ
). The expectation contains
exponentially many terms, but we can obtain an unbiased estimate of its gradient
by sampling values of µ.
Dropout training is not quite the same as bagging training. In the case of
bagging, the models are all independent. In the case of dropout, the models
share parameters, with each model inheriting a different subset of parameters
from the parent neural network. This parameter sharing makes it possible to
represent an exponential number of models with a tractable amount of memory.
In the case of bagging, each model is trained to convergence on its respective
training set. In the case of dropout, typically most models are not explicitly
trained at all—usually, the model is large enough that it would be infeasible to
sample all possible subnetworks within the lifetime of the universe. Instead, a tiny
fraction of the possible subnetworks are each trained for a single step, and the
parameter sharing causes the remaining subnetworks to arrive at good settings of
the parameters. These are the only differences. Beyond these, dropout follows the
bagging algorithm. For example, the training set encountered by each subnetwork
is indeed a subset of the original training set sampled with replacement.
To make a prediction, a bagged ensemble must accumulate votes from all
its members. We refer to this process as
inference
in this context. So far, our
description of bagging and dropout has not required that the model be explicitly
probabilistic. Now, we assume that the model’s role is to output a probability
distribution. In the case of bagging, each model
i
produces a probability distribution
257