
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
invent the concepts it needs to model a particular dataset. The latent variables are
usually not very easy for a human to interpret after the fact, though visualization
techniques may allow some rough characterization of what they represent. When
latent variables are used in the context of traditional graphical models, they are
often designed with some specific semantics in mind—the topic of a document,
the intelligence of a student, the disease causing a patient’s symptoms, etc. These
models are often much more interpretable by human practitioners and often have
more theoretical guarantees, yet are less able to scale to complex problems and are
not reusable in as many different contexts as deep models.
Another obvious difference is the kind of connectivity typically used in the deep
learning approach. Deep graphical models typically have large groups of units that
are all connected to other groups of units, so that the interactions between two
groups may be described by a single matrix. Traditional graphical models have very
few connections and the choice of connections for each variable may be individually
designed. This is tightly linked with the choice of inference algorithm. Traditional
approaches to graphical models typically aim to maintain the tractability of exact
inference. When this constraint is too limiting, a popular approximate inference
algorithm is an algorithm called loopy belief propagation. Both of these approaches
often work well with very sparsely connected graphs. By comparison, models used
in deep learning tend to make use of the idea of distributed representations. This
means that each observed variable
v
i
is connected to many latent variables
h
j
that
provide a distributed representation of
v
i
(and probably several other observed
variables too). Distributed representations have many advantages, but from the
point of view of graphical models and computational complexity, distributed
representations have the disadvantage of usually yielding graphs that are not
sparse enough for the traditional techniques of exact inference and loopy belief
propagation to be relevant. As a consequence, one of the most striking differences
between the larger graphical models community and the deep graphical models
community is that loopy belief propagation is almost never used for deep learning.
Most deep models are instead designed to make Gibbs sampling or variational
inference algorithms efficient. Another consideration is that deep learning models
contain a very large number of latent variables, making efficient numerical code
essential. This provides an additional motivation, besides the choice of high-level
inference algorithm, for grouping the units into layers with a matrix describing the
interaction between two layers. This allows the individual steps of the algorithm
to be implemented with efficient matrix product operations, or sparsely connected
generalizations, like block diagonal matrix products or convolutions.
Finally, the deep learning approach to graphical modeling is characterized by
a marked tolerance of the unknown. Rather than simplifying the model until
589