
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
enough information about the location of a global minimum that we can find the
global minimum by solving progressively less-blurred versions of the problem. This
approach can break down in three different ways. First, it might successfully define
a series of cost functions where the first is convex and the optimum tracks from
one function to the next, arriving at the global minimum, but it might require so
many incremental cost functions that the cost of the entire procedure remains high.
NP-hard optimization problems remain NP-hard, even when continuation methods
are applicable. The other two ways continuation methods fail both correspond to
the method not being applicable. First, the function might not become convex, no
matter how much it is blurred. Consider, for example, the function
J
(
θ
) =
−θ
θ
.
Second, the function may become convex as a result of blurring, but the minimum
of this blurred function may track to a local rather than a global minimum of the
original cost function.
Though continuation methods were mostly originally designed to deal with the
problem of local minima, local minima are no longer believed to be the primary
problem for neural network optimization. Fortunately, continuation methods can
still help. The easier objective functions introduced by the continuation method can
eliminate flat regions, decrease variance in gradient estimates, improve conditioning
of the Hessian matrix, or do anything else that will either make local updates
easier to compute or improve the correspondence between local update directions
and progress toward a global solution.
Bengio et al. (2009) observed that an approach called
curriculum learning
,
or
shaping
, can be interpreted as a continuation method. Curriculum learning is
based on the idea of planning a learning process to begin by learning simple concepts
and progress to learning more complex concepts that depend on these simpler
concepts. This basic strategy was previously known to accelerate progress in animal
training (Skinner, 1958; Peterson, 2004; Krueger and Dayan, 2009) and in machine
learning (Solomonoff, 1989; Elman, 1993; Sanger, 1994). Bengio et al. (2009)
justified this strategy as a continuation method, where earlier
J
(i)
are made easier by
increasing the influence of simpler examples (either by assigning their contributions
to the cost function larger coefficients, or by sampling them more frequently), and
experimentally demonstrated that better results could be obtained by following a
curriculum on a large-scale neural language modeling task. Curriculum learning
has been successful on a wide range of natural language (Spitkovsky et al., 2010;
Collobert et al., 2011a; Mikolov et al., 2011b; Tu and Honavar, 2011) and computer
vision (Kumar et al., 2010; Lee and Grauman, 2011; Supancic and Ramanan, 2013)
tasks. Curriculum learning was also verified as being consistent with the way in
which humans teach (Khan et al., 2011): teachers start by showing easier and
more prototypical examples and then help the learner refine the decision surface
324