
CHAPTER 15. REPRESENTATION LEARNING
many hyperparameters, whose effect may be measured after the fact but is often
difficult to predict ahead of time. When we perform unsupervised and supervised
learning simultaneously, instead of using the pretraining strategy, there is a single
hyperparameter, usually a coefficient attached to the unsupervised cost, that
determines how strongly the unsupervised objective will regularize the supervised
model. One can always predictably obtain less regularization by decreasing this
coefficient. In unsupervised pretraining, there is not a way of flexibly adapting
the strength of the regularization—either the supervised model is initialized to
pretrained parameters, or it is not.
Another disadvantage of having two separate training phases is that each phase
has its own hyperparameters. The performance of the second phase usually cannot
be predicted during the first phase, so there is a long delay between proposing
hyperparameters for the first phase and being able to update them using feedback
from the second phase. The most principled approach is to use validation set error
in the supervised phase to select the hyperparameters of the pretraining phase, as
discussed in Larochelle et al. (2009). In practice, some hyperparameters, like the
number of pretraining iterations, are more conveniently set during the pretraining
phase, using early stopping on the unsupervised objective, which is not ideal but
is computationally much cheaper than using the supervised objective.
Today, unsupervised pretraining has been largely abandoned, except in the
field of natural language processing, where the natural representation of words as
one-hot vectors conveys no similarity information and where very large unlabeled
sets are available. In that case, the advantage of pretraining is that one can pretrain
once on a huge unlabeled set (for example with a corpus containing billions of
words), learn a good representation (typically of words, but also of sentences), and
then use this representation or fine-tune it for a supervised task for which the
training set contains substantially fewer examples. This approach was pioneered
by Collobert and Weston (2008b), Turian et al. (2010), and Collobert et al. (2011a)
and remains in common use today.
Deep learning techniques based on supervised learning, regularized with dropout
or batch normalization, are able to achieve human-level performance on many tasks,
but only with extremely large labeled datasets. These same techniques outperform
unsupervised pretraining on medium-sized datasets such as CIFAR-10 and MNIST,
which have roughly 5,000 labeled examples per class. On extremely small datasets,
such as the alternative splicing dataset, Bayesian methods outperform methods
based on unsupervised pretraining (Srivastava, 2013). For these reasons, the
popularity of unsupervised pretraining has declined. Nevertheless, unsupervised
pretraining remains an important milestone in the history of deep learning research
533