
CHAPTER 9. CONVOLUTIONAL NETWORKS
backward propagation through the entire network. One way to reduce the cost of
convolutional network training is to use features that are not trained in a supervised
fashion.
There are three basic strategies for obtaining convolution kernels without
supervised training. One is to simply initialize them randomly. Another is to
design them by hand, for example, by setting each kernel to detect edges at a
certain orientation or scale. Finally, one can learn the kernels with an unsupervised
criterion. For example, Coates et al. (2011) apply
k
-means clustering to small
image patches, then use each learned centroid as a convolution kernel. In Part III
we describe many more unsupervised learning approaches. Learning the features
with an unsupervised criterion allows them to be determined separately from the
classifier layer at the top of the architecture. One can then extract the features for
the entire training set just once, essentially constructing a new training set for the
last layer. Learning the last layer is then typically a convex optimization problem,
assuming the last layer is something like logistic regression or an SVM.
Random filters often work surprisingly well in convolutional networks (Jarrett
et al., 2009; Saxe et al., 2011; Pinto et al., 2011; Cox and Pinto, 2011). Saxe et al.
(2011) showed that layers consisting of convolution followed by pooling naturally
become frequency selective and translation invariant when assigned random weights.
They argue that this provides an inexpensive way to choose the architecture of
a convolutional network: first, evaluate the performance of several convolutional
network architectures by training only the last layer; then take the best of these
architectures and train the entire architecture using a more expensive approach.
An intermediate approach is to learn the features, but using methods that do
not require full forward and back-propagation at every gradient step. As with
multilayer perceptrons, we use greedy layer-wise pretraining, to train the first layer
in isolation, then extract all features from the first layer only once, then train the
second layer in isolation given those features, and so on. In chapter 8 we described
how to perform supervised greedy layer-wise pretraining, and in part III extend this
to greedy layer-wise pretraining using an unsupervised criterion at each layer. The
canonical example of greedy layer-wise pretraining of a convolutional model is the
convolutional deep belief network (Lee et al., 2009). Convolutional networks offer
us the opportunity to take the pretraining strategy one step further than is possible
with multilayer perceptrons. Instead of training an entire convolutional layer at a
time, we can train a model of a small patch, as Coates et al. (2011) do with
k
-means.
We can then use the parameters from this patch-based model to define the kernels
of a convolutional layer. This means that it is possible to use unsupervised learning
to train a convolutional network without ever using convolution during the training
357