
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
•
Statistical efficiency: As the number of parameters in a model increases,
so does the amount of training data needed to choose the values of those
parameters using a statistical estimator. Because the table-based model has
an astronomical number of parameters, it will require an astronomically large
training set to fit accurately. Any such model will overfit the training set very
badly unless additional assumptions are made linking the different entries in
the table (as in back-off or smoothed n-gram models; section 12.4.1).
•
Runtime—the cost of inference: Suppose we want to perform an inference
task where we use our model of the joint distribution
P
(
x
) to compute some
other distribution, such as the marginal distribution
P
(
x
1
) or the conditional
distribution
P
(
x
2
| x
1
). Computing these distributions will require summing
across the entire table, so the runtime of these operations is as high as the
intractable memory cost of storing the model.
•
Runtime—the cost of sampling: Likewise, suppose we want to draw a sample
from the model. The naive way to do this is to sample some value
u ∼ U
(0
,
1),
then iterate through the table, adding up the probability values until they
exceed
u
and return the outcome corresponding to that position in the table.
This requires reading through the whole table in the worst case, so it has
the same exponential cost as the other operations.
The problem with the table-based approach is that we are explicitly modeling
every possible kind of interaction between every possible subset of variables. The
probability distributions we encounter in real tasks are much simpler than this.
Usually, most variables influence each other only indirectly.
For example, consider modeling the finishing times of a team in a relay race.
Suppose the team consists of three runners: Alice, Bob and Carol. At the start of
the race, Alice carries a baton and begins running around a track. After completing
her lap around the track, she hands the baton to Bob. Bob then runs his own
lap and hands the baton to Carol, who runs the final lap. We can model each of
their finishing times as a continuous random variable. Alice’s finishing time does
not depend on anyone else’s, since she goes first. Bob’s finishing time depends
on Alice’s, because Bob does not have the opportunity to start his lap until Alice
has completed hers. If Alice finishes faster, Bob will finish faster, all else being
equal. Finally, Carol’s finishing time depends on both her teammates. If Alice is
slow, Bob will probably finish late too. As a consequence, Carol will have quite a
late starting time and thus is likely to have a late finishing time as well. However,
Carol’s finishing time depends only indirectly on Alice’s finishing time via Bob’s.
If we already know Bob’s finishing time, we will not be able to estimate Carol’s
559