
CHAPTER 13. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
an astronomical number of parameters, it will require an astronomically
large training set to fit accurately. Any such model will overfit the training
set very badly.
• Runtime: the cost of inference: Suppose we want to perform an infer-
ence task where we use our model of the joint distribution P (x) to compute
some other distribution, such as the marignal distribution P (x
1
) or the con-
ditional distribution P (x
2
| x
1
). Computing these distributions will require
summing across the entire table, so the runtime of these operations is as
high as the intractable memory cost of storing the model.
• Runtime: the cost of sampling: Likewise, suppose we want to draw a
sample from the model. The naive way to do this is to sample some value
u ∼ U(0, 1), then iterate through the table adding up the probability values
until they exceed u and return the outcome whose probability value was
added last. This requires reading through the whole table in the worst case,
so it has the same exponential cost as the other operations.
The problem with the table-based approach is that we are explicitly modeling
every possible kind of interaction between every possible subset of variables. The
probability distributions we encounter in real tasks are much simpler than this.
Usually, most variables influence each other only indirectly.
For example, consider modeling the finishing times of a team in a relay race.
Suppose the team consists of three runners, Alice, Bob, and Carol. At the start
of the race, Alice carries a baton and begins running around a track. After
completing her lap around the track, she hands the baton to Bob. Bob then runs
his own lap and hands the baton to Carol, who runs the final lap. We can model
each of their finishing times as a continuous random variable. Alice’s finishing
time does not depend on anyone else’s, since she goes first. Bob’s finishing time
depends on Alice’s, because Bob does not have the opportunity to start his lap
until Alice has completed hers. If Alice finishes faster, Bob will finish faster, all
else being equal. Finally, Carol’s finishing time depends on both her teammates.
If Alice is slow, Bob will probably finish late too, and Carol will have quite a late
starting time and thus is likely to have a late finishing time as well. However,
Carol’s finishing time depends only indirectly on Alice’s finishing time via Bob’s.
If we already know Bob’s finishing time, we won’t be able to estimate Carol’s
finishing time better by finding out what Alice’s finishing time was. This means
we can model the relay race using only two interactions: Alice’s effect on Bob,
and Bob’s effect on Carol. We can omit the third, indirect interaction between
Alice and Carol from our model.
Structured probabilistic models provide a formal framework for modeling only
direct interactions between random variables. This allows the models to have
376