
CHAPTER 19. APPROXIMATE INFERENCE
this expense by learning to perform approximate inference. Specifically, we can
think of the optimization process as a function
f
that maps an input
v
to an
approximate distribution
q
∗
=
arg max
q
L
(
v, q
). Once we think of the multistep
iterative optimization process as just being a function, we can approximate it with
a neural network that implements an approximation
ˆ
f(v; θ).
19.5.1 Wake-Sleep
One of the main difficulties with training a model to infer
h
from
v
is that we
do not have a supervised training set with which to train the model. Given a
v
,
we do not know the appropriate
h
. The mapping from
v
to
h
depends on the
choice of model family, and evolves throughout the learning process as
θ
changes.
The wake-sleep algorithm (Hinton et al., 1995b; Frey et al., 1996) resolves this
problem by drawing samples of both
h
and
v
from the model distribution. For
example, in a directed model, this can be done cheaply by performing ancestral
sampling beginning at
h
and ending at
v
. The inference network can then be
trained to perform the reverse mapping: predicting which
h
caused the present
v
. The main drawback to this approach is that we will only be able to train the
inference network on values of
v
that have high probability under the model. Early
in learning, the model distribution will not resemble the data distribution, so the
inference network will not have an opportunity to learn on samples that resemble
data.
In section 18.2 we saw that one possible explanation for the role of dream sleep
in human beings and animals is that dreams could provide the negative phase
samples that Monte Carlo training algorithms use to approximate the negative
gradient of the log partition function of undirected models. Another possible
explanation for biological dreaming is that it is providing samples from
p
(
h, v
)
which can be used to train an inference network to predict
h
given
v
. In some
senses, this explanation is more satisfying than the partition function explanation.
Monte Carlo algorithms generally do not perform well if they are run using only
the positive phase of the gradient for several steps then with only the negative
phase of the gradient for several steps. Human beings and animals are usually
awake for several consecutive hours then asleep for several consecutive hours. It is
not readily apparent how this schedule could support Monte Carlo training of an
undirected model. Learning algorithms based on maximizing
L
can be run with
prolonged periods of improving
q
and prolonged periods of improving
θ
, however.
If the role of biological dreaming is to train networks for predicting
q
, then this
explains how animals are able to remain awake for several hours (the longer they
are awake, the greater the gap between
L
and
log p
(
v
), but
L
will remain a lower
649