CHAPTER 11. PRACTICAL METHODOLOGY
would give the same result. In the case of grid search, the other hyperparameters
would have the same values for these two runs, whereas with random search, they
would usually have different values. Hence if the change between these two values
does not marginally make much difference in terms of validation set error, grid
search will unnecessarily repeat two equivalent experiments while random search
will still give two independent explorations of the other hyperparameters.
11.4.5 Model-Based Hyperparameter Optimization
The search for good hyperparameters can be cast as an optimization problem.
The decision variables are the hyperparameters. The cost to be optimized is the
validation set error that results from training using these hyperparameters. In
simplified settings where it is feasible to compute the gradient of some differentiable
error measure on the validation set with respect to the hyperparameters, we can
simply follow this gradient (Bengio et al., 1999; Bengio, 2000; Maclaurin et al.,
2015). Unfortunately, in most practical settings, this gradient is unavailable, either
because of its high computation and memory cost, or because of hyperparameters
that have intrinsically nondifferentiable interactions with the validation set error,
as in the case of discrete-valued hyperparameters.
To compensate for this lack of a gradient, we can build a model of the validation
set error, then propose new hyperparameter guesses by performing optimization
within this model. Most model-based algorithms for hyperparameter search use a
Bayesian regression model to estimate both the expected value of the validation set
error for each hyperparameter and the uncertainty around this expectation. Opti-
mization thus involves a trade-off between exploration (proposing hyperparameters
for that there is high uncertainty, which may lead to a large improvement but may
also perform poorly) and exploitation (proposing hyperparameters that the model
is confident will perform as well as any hyperparameters it has seen so far—usually
hyperparameters that are very similar to ones it has seen before). Contemporary
approaches to hyperparameter optimization include Spearmint (Snoek et al., 2012),
TPE (Bergstra et al., 2011) and SMAC (Hutter et al., 2011).
Currently, we cannot unambiguously recommend Bayesian hyperparameter
optimization as an established tool for achieving better deep learning results or
for obtaining those results with less effort. Bayesian hyperparameter optimization
sometimes performs comparably to human experts, sometimes better, but fails
catastrophically on other problems. It may be worth trying to see if it works on a
particular problem but is not yet sufficiently mature or reliable. That being said,
hyperparameter optimization is an important field of research that, while often
driven primarily by the needs of deep learning, holds the potential to benefit not
430