Chapter 21
Deep Generative Models
In this chapter, we present several of the specific kinds of generative models that can
be built and trained using the techniques presented in chapters 14, 19 and 20. All
of these models represent probability distributions over multiple variables in some way.
Some allow the probability distribution function to be evaluated explicitly. Others do
not allow the evaluation of the probability distribution function, but support operations
that implicitly require knowledge of it, such as sampling. Some of these models are
structured probabilistic models described in terms of graphs and factors, as described
in chapter 14. Others can not easily be described in terms of factors, but represent
probability distributions nonetheless.
21.1 Restricted Boltzmann Machines
Restricted Boltzmann machines are some of the most common building blocks of deep
probabilistic models. They are undirected probabilistic graphical models containing a
layer of observable variables and a single layer of latent variables. RBMs may be stacked
(one on top of the other) to form deeper models. See Fig. 21.1 for some examples. In
particular, Fig. 21.1a shows the graph structure of an RBM itself. It is a bipartite graph:
with no connections permitted between any variables in the observed layer or between
any units in the latent layer.
TODO– review and pointers to other sections of the book This should be the main
place where they are described in detail, earlier they are just an example of undirected
models or an example of a feature learning algorithm.
More formally, we will consider the observed layer to consist of a set of D binary
random variables which we refer to collectively with the vector v, where the ith element,
i.e. v
i
is a binary random variable. We will refer to the latent or hidden layer of N
random variables collectively as h, with the jth random elements as h
j
.
The restricted Boltzmann machine is an energy-based model, which means that the
joint probability distribution is fully-specified by its energy function:
P (v, h) =
1
Z
exp {−E(v, h)}.
382
h
1
h
2
h
3
v
1
v
2
v
3
h
4
h
1
(1)
h
2
(1)
h
3
(1)
v
1
v
2
v
3
h
1
(2)
h
2
(2)
h
3
(2)
h
4
(1)
h
1
(1)
h
2
(1)
h
3
(1)
v
1
v
2
v
3
h
1
(2)
h
2
(2)
h
3
(2)
h
4
(1)
Figure 21.1: Examples of models that may be built with restricted Boltzmann machines.
a) The restricted Boltzmann machine itself is an undirected graphical model based on a
bipartite graph. There are no connections among the visible units, nor any connections
among the hidden units. Typically every visible unit is connected to every hidden unit
but it is possible to construct sparsely connected RBMs such as convolutional RBMs. b)
A deep belief network is a hybrid graphical model involving both directed and undirected
connections. Like an RBM, it has no intra-layer connections. However, a DBN has
multiple hidden layers, and thus there are connections between hidden units that are in
separate layers. All of the local conditional probability distributions needed by the deep
belief network are copied directly from the local conditional probability distributions of
its constituent RBMs. Note that we could also represent the deep belief network with
a completely undirected graph, but it would need intra-layer connections to capture
the dependencies between parents. c) A deep Boltzmann machine is an undirected
graphical model with several layers of latent variables. Like RBMs and DBNs, DBMs
lack intra-layer connections. DBMs are less closely tied to RBMs than DBNs are. When
initializing a DBM from a stack of RBMs, it is necessary to modify the RBM parameters
slightly. Some kinds of DBMs may be trained without first training a set of RBMs.
383
Where E(v, h) is the energy function that parametrizes the relationship between the
visible and hidden variables:
E(v, h) = b
>
v c
>
h v
>
W h, (21.1)
and the Z is the normalizing constant known as the partition function:
Z =
X
v
X
h
exp {−E(v, h)}.
For many undirected models, it is apparent from the definition of the partition func-
tion Z that the naive method of computing Z (exhaustively summing over all states)
would be computationally intractable. However, it is still possible that a more cleverly
designed algorithm could exploit regularities in the probability distribution to compute
Z faster than the naive algorithm, so the exponential cost of the naive algorithm is not
a guarantee of the partiion function’s intractability. In the case of restricted Boltzmann
machines, there is actually a hardness result, proven by Long and Servedio (2010).
21.1.1 Conditional Distributions
The intractable partition function Z, implies that the joint probability distribution is
also intractable (in the sense that the normalized probability of a given joint configura-
tion of v, h is generally not available). However, due the bipartite graph structure, the
restricted Boltzmann machine has the very special property that its conditional distri-
butions P (h | v) and P (v | h) are factorial and relatively simple to compute and sample
from. Indeed, it is this property that has made the RBM a relatively popular model
for a wide range of applications including image modeling (CITE), speech processing
(CITE) and natural language processing (CITE).
Deriving the conditional distributions from the joint distribution is straightforward.
p(h | v) =
p(h, v)
p(v)
=
p(h, v)
p(v)
=
1
p(v)
1
Z
exp
n
b
>
v + c
>
h + v
>
W h
o
=
1
Z
0
exp
n
c
>
h + v
>
W h
o
=
1
Z
0
exp
n
X
j=1
c
j
h
j
+
n
X
j=1
v
>
W
:,j
h
j
=
1
Z
0
n
Y
j=1
exp
n
c
j
h
j
+ v
>
W
:,j
h
j
o
(21.2)
384
Since we are conditioning on the visible units v, we can treat these as constants w.r.t.
the distribution p(h | v). The factorial nature of the conditional p(h | v) follows
immediately from our ability to wright the joint probability over the vector h as the
product of (unnormalized) distributions over the individual elements, h
j
. It is now a
simple matter of normalizing the distributions over the individual binary h
j
.
P (h
j
= 1 | v) =
˜
P (h
j
= 1 | v)
˜
P (h
j
= 0 | v) +
˜
P (h
j
= 1 | v)
=
exp
c
j
+ v
>
W
:,j
exp {0}+ exp {c
j
+ v
>
W
:,j
}
= sigmoid
c
j
+ v
>
W
:,j
. (21.3)
We can now express the full conditional over the hidden layer as the factorial distribution:
P (h | v) =
n
Y
j=1
sigmoid
c
j
+ v
>
W
:,j
. (21.4)
A similar derivation will show that the other condition of interest to us, P (v | h), is
also a factorial distribution:
P (v | h) =
d
Y
i=1
sigmoid (b
i
+ W
i,:
h) . (21.5)
21.1.2 RBM Gibbs Sampling
The factorial nature of these conditions is a very useful property of the RBM, and allows
us to efficiently draw samples from the joint distribution via a block Gibbs sampling
strategy (see section 15.1 for a more complete discussion of Gibbs sampling methods).
Block Gibbs sampling simply refers to the situation where in each step of Gibbs
sampling, multiple variables (or a “block” of variables) are sampled jointly. In the case
of the RBM, each iteration of block Gibbs sampling consists of two steps. Step 1:
Sample h
(l)
P (h | v
(l)
). Due to the factorial nature of the conditionals, we can
simultaneously and independently sample from all the elements of h
(l)
given v
(l)
. Step
2: Sample v
(l+1)
P (v | h
(l)
). Again, the factorial nature of the conditional P(v | h
(l)
)
allows us can simultaneously and independently sample from all the elements of v
(l+1)
given h
(l)
.
21.2 Training Restricted Boltzmann Machines
Despite the simplicity of the RBM conditionals, training these models is not without
its complications. As a probabilistic model, a sensible inductive principle for estimating
the model parameters is maximum likelihood though other possibilities are certainly
possible Marlin et al. (2010) and will be discussed later in Sec. 21.2.3. In the following
we derive the maximum likelihood gradient with respect to the model parameters.
385
Let us consider that we have a batch (or minibatch) of n examples taken from an i.i.d
dataset (independently and identically distributed examples) {v
(1)
, . . . , v
(t)
, . . . , v
(n)
}.
The log likelihood under the RBM with parameters b (visible unit biases), c (hidden
unit biases) and W (interaction weights) is given by:
`(W , b, c) =
n
X
t=1
log P (v
(t)
)
=
n
X
t=1
log
X
h
P (v
(t)
n,:
, h)
=
n
X
t=1
log
X
h
exp
n
E(v
(t)
, h)
o
!
n log Z
=
n
X
t=1
log
X
h
exp
n
E(v
(t)
, h)
o
!
n log
X
v,h
exp {−E(v, h)}
(21.6)
In the last line of the equation above, we have used the definition of the partition
function.
To maximize the likelihood of the data under the restricted Boltzmann machine, we
consider the gradient of the likelihood with respect to the model parameters, which we
will refer to collectively as θ = {b, c, W }:
θ
`(θ) =
θ
n
X
t=1
log
X
h
exp
n
E(v
(t)
, h)
o
!
n
θ
log
X
v,h
exp {−E(v, h)}
=
n
X
t=1
P
h
exp
E(v
(t)
, h)
θ
E(v
(t)
, h)
P
h
exp
E(v
(t)
, h)
n
P
v,h
exp {−E(v, h)}
θ
E(v, h)
P
v,h
exp {−E(v, h)}
=
n
X
t=1
E
P (h|v
(t)
)
θ
E(v
(t)
, h)
nE
P (v,h)
θ
E(v, h)
(21.7)
As we can see from Eq. 21.7, the gradient of the log likelihood is specified as the difference
between two expectations of the gradient of the energy function, The first expectation
(the data term) is with respect to the product of the empirical distribution over the data,
P (v) = 1/n
P
n
t=1
δ(x v
(t)
)
1
and the conditional distribution P(h | v
(t)
). The second
expectation (the model term) is with respect to the joint model distribution P (v, h).
This difference between a data-driven term and a model-driven term is not unique to
RBMs, as discussed in some detail in Sec. 19.2, this is a general feature of the maximum
likelihood gradient for all undirected models.
We can complete the derivation of log-likelihood gradient by expanding the term:
θ
E(v, h). We will consider first the gradient of the negative energy function of W .
1
As discussed in Sec. 3.10.4, we use the term empirical distribution to refer to a mixture over delta
functions placed on training examples
386
W
E(v, h) =
W
b
>
v + c
>
h + v
>
W h
= hv
>
(21.8)
The gradients with respect to b and c are similarly derived:
b
E(v, h) = v,
c
E(v, h) = h (21.9)
Putting it all together we can the following equations for the gradients with respect
to the RBM parameters and given n training examples:
W
`(W , b, c) =
n
X
t=1
ˆ
h
(t)
v
(t) >
N E
P (v,h)
h
hv
>
i
b
`(W , b, c) =
n
X
t=1
v
(t)
nE
P (v,h)
[v]
c
`(W , b, c) =
n
X
t=1
ˆ
h
(t)
nE
P (v,h)
[h]
where we have defined
ˆ
h
(t)
as
ˆ
h
(t)
= E
P (h|v
(t)
)
[h] = sigmoid
c + v
(t)
W
. (21.10)
While we are able to write down these expressions for the log-likelihood gradient,
unfortunately, in most situations of interest, we are not able to use them directly to
calculate gradients. The problem is the expectations over the joint model distribution
P (v, h). While we have conditional distributions P(v | h) and P (h | v) that are easy
to work with, the RBM joint distribution is not amenable to analytic evaluation of the
expectation E
P (v,h)
[f(v, h)].
This is bad news, it implies that in most cases it is impractical to compute the exact
log-likelihood gradient. Fortunately, as discussed in Sec. 19.2, there are two widely
used approximation strategies that have been applied to the training of RBM with some
degree of success: contrastive divergence and stochastic maximum likelihood.
In the following sections we discuss two different strategies to approximate this gra-
dient that have been applied to training the RBM. However, before getting into the
actual training algorithms, it is worth considering what general approaches are available
to us in approximating the log-likelihood gradient. As we mentioned, our problem stems
from the expectation over the joint distribution P (v, h), but we know that we have ac-
cess to factorial conditionals and that we can use these as the basis of a Gibbs sampling
procedure to recover samples form the joint distribution (as discussed in Sec. 21.1.2).
387
Thus, we can imagine using, for example, T MCMC samples from P (v, h) to form a
Monte Carlo estimate of the expectations over the joint distribution:
E
P (v,h)
[f(v, h)]
1
T
T
X
t=1
f(v
(t)
, h
(t)
). (21.11)
There is a problem with this strategy that has to do with the initialization of the MCMC
chain. MCMC chains typically require a burn-in period, where the chain
21.2.1 Contrastive Divergence Training of the RBM
As discussed in a more general context in Sec. 19.2, Contrastive divergence (CD) seeks
to approximate the expectation over the joint distribution with samples drawn from
short Giibs sampling chains. CD deals with the typical requirement for an extended
burn-in sample sequence by initializing these chains at the data points used in the
data-dependent, conditional term. The result is a biased approximation of the log-
likelihood gradient (Carreira-Perpi˜nan and Hinton, 2005; Bengio and Delalleau, 2009;
Fischer and Igel, 2011), that never-the-less has been empirically shown to be effective.
The constrastive divergence algorithm, as applied to RBMs, is given in Algorithm 21.1.
21.2.2 Stochastic Maximum Likelihood (Persistent Contrastive Diver-
gence) for the RBM
While contrastive divergence has been the most popular method of training RBMs,
the stochastic maximum likelihood (SML) algorithm (Younes, 1998; Tieleman, 2008) is
known to be a competitive alternative – especially if we are interested in recovering the
best possible generative model (i.e. achieving the highest possible test set likelihood).
As with CD, the general SML algorithm is described in Sec. 19.2. Here we are concerned
with how to apply the algorithm to training an RBM.
In comparison to the CD algorithm, SML uses an alternative solution to the prob-
lem of how to approximate the partition function’s contribution to the log-likelihood
gradient. Instead of initializing the k-step MCMC chain with the current example from
the training set, in SML we initialize the MCMC chain for training iteration s with the
last state of the MCMC chain from the last training iteration (s 1). Assuming that
the gradient updates to the model parameters do not significantly change the model,
the MCMC state of the last iteration should be close to the equilibrium distribution at
iteration s minimizing the number of “burn-in” MCMC steps needed to reach equilib-
rium at the current iteration. As with CD, in practice we often use just one Gibbs step
between learning iterations. Algorithm 21.2 describes the SML algorithm as applied to
RBMs.
TODO: include experimental examples, i.e. an RBM trained with CD on MNIST
21.2.3 Other Inductive Principles
TODO:Other inductive principles have been used to train RBMs. In this section we
briefly discuss these.
388
Algorithm 21.1 The contrastive divergence algorithm, using gradient ascent as the
optimization procedure.
Set , the step size, to a small positive number
Set k, the number of Gibbs steps, high enough to allow a Markov chain of p(v; θ)
to mix when initializedfrom p
data
. Perhaps 1-20 to train an RBM on a small image
patch.
while Not converged do
Sample a minibatch of m examples from the training set {v
(1)
, . . . , v
(m)
}.
W
1
m
P
m
t=1
ˆ
h
(t)
v
(t) >
b
1
m
P
m
t=1
v
(t)
c
1
m
P
m
t=1
ˆ
h
(t)
for t = 1 to m do
˜
v
(t)
v
(t)
end for
for l = 1 to k do
for t = 1 to m do
˜
h
(t)
sampled from
Q
n
j=1
sigmoid
c
j
+
˜
v
(t) >
W
:,j
.
˜
v
(t)
sampled from
Q
d
i=1
sigmoid
b
i
+ W
i,:
˜
h
(t)
.
end for
end for
¯
h
(t)
sigmoid
c +
˜
v
(t)
W
W
W
1
m
P
m
t=1
¯
h
(t)
˜
v
(t) >
b
b
1
m
P
m
t=1
˜
v
(t)
c
b
1
m
P
m
t=1
¯
h
(t)
W W +
W
b b +
b
c c +
c
end while
21.3 Deep Belief Networks
Deep belief networks (DBNs) were one of the first successful non-convolutional architec-
tures. The introduction of deep belief networks in 2006 began the current deep learning
renaissance. Prior to the introduction of deep belief networks, deep models were con-
sidered too difficult to optimize, due to the vanishing and exploding gradient problems
and the existence of plateaus, negative curvature, and suboptimal local minima that can
arise in neural network objective functions. Kernel machines with convex objective func-
tions dominated the research landscape. Deep belief networks demonstrated that deep
architectures can be successful, by outperforming kernelized support vector machines
on the MNIST dataset (Hinton et al., 2006). Today, deep belief networks have mostly
fallen out of favor and are rarely used, even compared to other unsupervised or gener-
ative learning algorithms, but they are still deservedly recognized for their important
389
Algorithm 21.2 The stochastic maximum likelihood / persistent contrastive divergence
algorithm for training an RBM.
Set , the step size, to a small positive number
Set k, the number of Gibbs steps, high enough to allow a Markov chain of p(v, h; θ +
θ
) toburn in, starting from samples from p(v, h; θ). Perhaps 1 for RBM on a small
image patch.
Initialize a set of m samples {
˜
v
(1)
, . . . ,
˜
v
(m)
} to random values (e.g., from a uniform or
normal distribution, or possibly a distribution with marginals matched to the model’s
marginals)
while Not converged do
Sample a minibatch of m examples {v
(1)
, . . . , v
(m)
} from the training set.
W
1
m
P
m
t=1
ˆ
h
(t)
v
(t) >
b
1
m
P
m
t=1
v
(t)
c
1
m
P
m
t=1
ˆ
h
(t)
for l = 1 to k do
for t = 1 to m do
˜
h
(t)
sampled from
Q
n
j=1
sigmoid
c
j
+
˜
v
(t) >
W
:,j
.
˜
v
(t)
sampled from
Q
d
i=1
sigmoid
b
i
+ W
i,:
˜
h
(t)
.
end for
end for
g g
1
m
P
m
i=1
θ
log ˜p(
˜
x
(i)
; θ)
θ θ + g
end while
role in deep learning history.
Deep belief networks are generative models with several layers of latent variables.
The latent variables are typically binary, and the visible units may be binary or real.
There are no intra-layer connections. Usually, every unit in each layer is connected to
every unit in each neighboring layer, though it is possible to construct more sparsely
connected DBNs. The connections between the top two layers are undirected. The
connections between all other layers are directed, with the arrows pointed toward the
layer that is closest to the data. See Fig. 21.1b for an example.
A DBN with L hidden layers contains L weight matrices: W
(1)
, . . . , W
(L)
. It also
contains L + 1 bias vectors: b
(0)
, . . . , b
(L)
with b
(0)
providing the biases for the visible
layer. The probability distribution represented by the DBN is given by
p(h
(L)
, h
(L1)
) exp
b
(L)
>
h
(L)
+ b
(L1)>
h
(L1)
+ h
(L1)>
W
(L)
h
(L)
,
p(h
(l)
i
= 1 | h
(l+1)
) = σ
b
(l)
i
+ W
(l+1)>
:,i
h
(l+1)
i, l 1, . . . , L 2,
p(v
i
= 1 | h
(1)
) = σ
b
(0)
i
+ W
(1)>
:,i
h
(1)
i.
390
In the cause of real-valued visible units, substitute
v N
v | b
(0)
+ W
(1)>
h
(1)
, β
1
with β diagonal for tractability. Generalizations to other exponential family visible units
are straightforward, at least in theory. Note that a DBN with only one hidden layer is
just an RBM.
To generate a sample from a DBN, we first run several steps of Gibbs sampling on
the top two hidden layers. This stage is essentially drawing a sample from the RBM
defined by the top two hidden layers. We can then use a single pass of ancestral sampling
through the rest of the model to draw a sample from the visible units.
Inference in a deep belief network is intractable due to the explaining away effect
within each directed layer, and due to the interaction between the two final hidden layers.
Evaluating or maximizing the standard evidence lower bound on the log likelihood is also
intractable, because the evidence lower bound takes the expectation of cliques whose
size is equal to the network width.
Evaluating or maximizing the log likelihood requires not just confronting the problem
of intractable inference to marginalize out the latent variables, but also the problem of
an intractable partition function within the undirected model of the last two layers.
As a hybrid of directed and undirected models, deep belief networks encounter many
of the difficulties associated with both families of models. Because deep belief networks
are partially undirected, they require Markov chains for sampling and have an intractable
partition function. Because they are directed and generally consist of binary random
variables, their evidence lower bound is intractable.
TODO–training procedure TODO–discriminative fine-tuning TODO–view of MLP
as variational inference with very loose bound comment on how this does not capture
intra-layer explaining away interactions comment on how this does not capture inter-
layer feedback interactions TODO–quantitative analysis with AIS TODO–wake sleep?
The term “deep belief network” is commonly used incorrectly to refer to any kind of
deep neural network, even networks without latent variable semantics. The term “deep
belief network” should refer specifically to models with undirected connections in the
deepest layer and directed connections pointing downward between all other pairs of
sequential layers.
The term “deep belief network” may also cause some confusion because the term
“belief network” is sometimes used to refer to purely directed models, while deep belief
networks contain an undirected layer. Deep belief networks also share the acronym DBN
with dynamic Bayesian networks, which are Bayesian networks for representing Markov
chains.
21.4 Deep Boltzmann Machines
A deep Boltzmann machine (DBM) is another kind of deep, generative model (Salakhut-
dinov and Hinton, 2009a). Unlike the deep belief network (DBN), it is an entirely undi-
rected model. Unlike the RBM, the DBM has several layers of latent variables (RBMs
391
have just one). But like the RBM, within each layer, each of the variables are mutually
independent, conditioned on the variables in the neighboring layers. See Fig. 21.2 for
the graph structure.
Like RBMs and DBNs, DBMs typically contain only binary units – as we assume in
our development of the model but it may sometimes contain real-valued visible units.
A DBM is an energy-based model, meaning that the the joint probability distribution
over the model variables is parametrized by an energy function E. In the case of a deep
Boltzmann machine with one visible layer, v, and three hidden layers, h
(1)
, h
(2)
andh
(3)
,
the joint probability is given by:
P
v, h
(1)
, h
(2)
, h
(3)
=
1
Z(θ)
exp
E(v, h
(1)
, h
(2)
, h
(3)
; θ)
. (21.12)
The DBM energy function is:
E(v, h
(1)
, h
(2)
, h
(3)
; θ) = v
>
W
(1)
h
(1)
h
(1)>
W
(2)
h
(2)
h
(2)>
W
(3)
h
(3)
. (21.13)
In comparison to the RBM energy function (Eq. 21.1), the DBM energy function in-
cludes connections between the hidden units (latent variables) in the form of the weight
matrices (W
(2)
and W
(3)
). As we will see, these connections have significant conse-
quences for both the model behavior as well as how we go about performing inference
in the model.
In comparison to fully connected Boltzmann machines (with every unit connectioned
to every other unit), the DBM offers some similar advantages as offered by the RBM.
Specifically, as illustrated in Fig. (TODO: include figure), the DBM layers can be
organized into a bipartiite graph, with odd layers on one side and even layers on the
other. This immediately implies that when we condition on the variables in the even
layer, the variables in the odd layers become conditionally independent. Of course,
when we condition on the variables in the odd layers, the variables in the even layers
also become conditionally independent.
We show this explicitly for the conditional distribution P(h
(1)
= 1 | v, h
(2)
), in the
case of a DBM with two hidden layers (of course, this result generalizes to a DBM with
392
h
1
(1)
h
2
(1)
h
3
(1)
v
1
v
2
v
3
h
1
(2)
h
2
(2)
h
3
(2)
h
4
(1)
Figure 21.2: The deep Boltzmnn machine (biases on all units are present but suppressed
to simplify notation).
393
any number of layers).
P (h
(1)
| v, h
(2)
) =
P (h
(1)
, v, h
(2)
)
P (v, h
(2)
)
=
exp
v
>
W
(1)
h
(1)
+ h
(1)>
W
(2)
h
(2)
P
1
h
(1)
1
=0
···
P
1
h
(1)
n
=0
exp
v
>
W
(1)
h
(1)
+ h
(1)>
W
(2)
h
(2)
=
exp
v
>
W
(1)
h
(1)
+ h
(1)>
W
(2)
h
(2)
P
1
h
(1)
1
=0
···
P
1
h
(1)
n
=0
exp
v
>
W
(1)
h
(1)
+ h
(1)>
W
(2)
h
(2)
=
exp
P
n
j=1
v
>
W
(1)
:,j
h
(1)
j
+ h
(1)>
j
W
(2)
j,:
h
(2)
P
1
h
(1)
1
=0
···
P
1
h
(1)
n
=0
exp
P
n
j
0
=1
v
>
W
(1)
:,j
0
h
(1)
j
0
+ h
(1)>
j
0
W
(2)
j
0
,:
h
(2)
=
Q
j
exp
v
>
W
(1)
:,j
h
(1)
j
+ h
(1)>
j
W
(2)
j,:
h
(2)
P
1
h
(1)
1
=0
···
P
1
h
(1)
n
=0
Q
j
0
exp
v
>
W
(1)
:,j
0
h
(1)
j
0
+ h
(1)>
j
0
W
(2)
j
0
,:
h
(2)
=
Y
j
exp
v
>
W
(1)
:,j
h
(1)
j
+ h
(1)>
j
W
(2)
j,:
h
(2)
P
1
h
(1)
j
=0
exp
v
>
W
(1)
:,j
h
(1)
j
+ h
(1)>
j
W
(2)
j,:
h
(2)
=
Y
j
exp
v
>
W
(1)
:,j
h
(1)
j
+ h
(1)>
j
W
(2)
j,:
h
(2)
1 + exp
v
>
W
(1)
:,j
+ W
(2)
j,:
h
(2)
=
Y
j
exp
v
>
W
(1)
:,j
h
(1)
j
+ h
(1)>
j
W
(2)
j,:
h
(2)
1 + exp
v
>
W
(1)
:,j
+ W
(2)
j,:
h
(2)
=
Y
j
P (h
(1)
j
| v, h
(2)
). (21.14)
From the above we can conclude that the conditional distribution for any layer of the
DBM conditioned on the neighboring layers, is fractorial (i.e. all variables in the layer
are conditionally independent). Further, we’ve shown that this conditional distribution
is given by a logistic sigmoid function:
P (h
(1)
j
= 1 | v, h
(2)
) =
exp
v
>
W
(1)
:,j
+ W
(2)
j,:
h
(2)
1 + exp
v
>
W
(1)
:,j
+ W
(2)
j,:
h
(2)
=
1
1 + exp
v
>
W
(1)
:,j
W
(2)
j,:
h
(2)
= sigmoid
v
>
W
(1)
:,j
+ W
(2)
j,:
h
(2)
. (21.15)
For the two layer DBM, the conditional distributions of the remaining two layers
(v, h
(2)
) also factorize. That is P (v | h
(1)
) =
Q
d
i=1
P (v
i
| h
(1)
), where
P (v
i
= 1 | h
(1)
) = sigmoid
W
(1)
i,:
h
(1)
. (21.16)
394
Also, P (h
(2)
| h
(1)
) =
Q
m
k=1
P (h
(2)
k
| h
(1)
), where
P (h
(2)
k
= 1 | h
(1)
) = sigmoid
h
(1)>
W
(2)
:,k
. (21.17)
21.4.1 Interesting Properties
TODO: comparison to DBNs TODO: comparison to neuroscience (local learning) “most
biologically plausible” TODO: description of easy mean field TODO: description of
sampling, comparison to general Boltzmann machines,DBNs
21.4.2 Mean Field Inference in the DBM
For the two hidden layer DBM, the conditional distributions, P (v | h
(1)
), P (h
(1)
|
v, h
(2)
), and P(h
(2)
| h
(1)
) are factorial, however the posterior distribution over all the
hidden units given the visible unit, i.e. P(h
(1)
, h
(2)
| v), can be complicated. This is, of
course, due to the interaction weights W
(2)
between h
(1)
and h
(2)
which render these
variables mutually dependent, given an observed v.
So, like the DBN we are left to seek out methods to approximate the DBM poste-
rior distribution. However, unlike the DBN, the DBM posterior distribution over their
hidden units while complicated is easy to approximate with a variational approxi-
mation (as discussed in Sec. 20.1), specifically a mean field approximation. The mean
field approximation is a simple form of variational inference, where we restrict the ap-
proximating distribution to fully factorial distributions. In the context of DBMs, the
mean eld equations capture the bidirectional interactions between layers. In this sec-
tion we derive the iterative approximate inference procedure originally introduced in
Salakhutdinov and Hinton (2009a)
In variational approximations to inference, we approach the task of approximating
a particular target distribution – in our case, the posterior distribution over the hidden
units given the visible units – by some reasonably simple family of distributions. In the
case of the mean field approximation, the approximating family is the set of distributions
where the hidden units are conditionally independent.
Let Q(h
(1)
, h
(2)
| v) be the approximation of P (h
(1)
, h
(2)
| v). The mean field
assumption implies that
Q(h
(1)
, h
(2)
| v) =
n
Y
j=1
Q(h
(1)
j
| v)
m
Y
k=1
Q(h
(2)
k
| v). (21.18)
The mean field approximation attempts to find for every observation a member of
this family of distributions that “best fits” the true posterior P(h
(1)
, h
(2)
| v). By best
fit, we specifically mean that we wish to find the approximation Q that minimizes the
KL-divergence with P , i.e. KL(QkP ) where:
KL(QkP ) =
X
h
Q(h
(1)
, h
(2)
| v) log
Q(h
(1)
, h
(2)
| v)
P (h
(1)
, h
(2)
| v)
!
(21.19)
395
In general, we do not have to provide a parametric form of the approximating distri-
bution beyond enforcing the independence assumptions. The variational approximation
procedure is generally able to recover a functional form of the approximate distribution.
However, in the case of a mean field assumption on binary hidden units (the case we are
considering here) there is no loss of generality by fixing a parametrization of the model
in advance.
We parametrize Q as a product of Bernoulli distributions, that is we consider the
probability of each element of h
(1)
to be associated with a parameter. Specifically, for
each j {1, . . . , n},
ˆ
h
(1)
j
= P (h
(1)
j
= 1), where
ˆ
h
(1)
j
[0, 1] and for each k {1, . . . , m},
ˆ
h
(2)
k
= P(h
(2)
k
= 1), where
ˆ
h
(2)
k
[0, 1]. Thus we have the following approximation to
the posterior:
Q(h
(1)
, h
(2)
| v) =
n
Y
j=1
Q(h
(1)
j
| v)
m
Y
k=1
Q(h
(2)
k
| v)
=
n
Y
j=1
(
ˆ
h
(1)
j
)
h
(1)
j
(1
ˆ
h
(1)
j
)
(1h
(1)
j
)
×
m
Y
k=1
(
ˆ
h
(2)
k
)
h
(2)
k
(1
ˆ
h
(2)
k
)
(1h
(2)
k
)
(21.20)
Of course, for DBMs with more layers the approximate posterior parametrization can
be extended in the obvious way.
Now that we have specified our family of approximating distributions Q. It remains
to specify a procedure for choosing the member of this family that best fits P . One way to
do this is to explicitly minimize KL(QkP ) with respect to the variational parameters of
Q. We will approach the selection of Q from a slightly different, but entirely equivalent,
path. Rather than minimize KL(QkP), we will maximize the variational lower bound
(or evidence lower bound: see Sec. 20.1), which in the context of the 2-hidden-layer
deep Boltzmann machine is given by:
L(Q) =
X
h
(1)
,h
(2)
Q(h
(1)
, h
(2)
| v) log
P (v, h
(1)
, h
(2)
; θ)
q(h
(1)
, h
(2)
| v)
!
=
X
h
(1)
,h
(2)
Q(h
(1)
, h
(2)
| v)E(v, h
(1)
, h
(2)
; θ) log Z(θ) + H(Q)
21.4.3 Variational Expectation Maximization
TODO Variational EM for DBM
21.4.4 Variational Learning With SML
Because a deep Boltzmann machine contains restricted Boltzmann machines as compo-
nents, the hardness results for computing the partition function and sampling that apply
to restricted Boltzmann machines also apply to deep Boltzmann machines. This means
that evaluating the probability mass function of a Boltzmann machine requires approx-
imate methods such as annealed importance sampling. Likewise, training the model
396
requires approximations to the gradient of the log partition function. See chapter 19 for
a description of these methods.
The posterior distribution over the hidden units in a deep Boltzmann machine is
intractable, due to the interactions between different hidden layers. This means that
we must use approximate inference during learning. The standard approach is to use
stochastic gradient ascent on the mean field lower bound, as described in chapter 20.
Mean field is incompatible with most of the methods for approximating the gradients
of the log partition function described in chapter 19. Moreover, contrastive divergence
offers no speedup relative to naive MCMC methods when the posterior is intractable, be-
cause sampling the negative phase fantasy particles requires sampling from the posterior.
This means that DBMs are usually trained using stochastic maximum likelihood. The
negative phase samples can be generated simply by running a Gibbs sampling chain that
alternates between sampling the odd-numbered layers and sampling the even-numbered
layers.
Unfortunately, training a DBM using variational learning and SML from a random
initialization usually results in failure. In some cases, the model fails to learn to represent
the distribution adequately. In other cases, the DBM may represent the distribution
well, but with no higher likelihood than could be obtained with just an RBM. Note that
a DBM with very small weights in all but the first layer represents approximately the
same distribution as an RBM.
It is not clear exactly why this happens. When DBMs are initialized from a pre-
trained configuration, training usually succeeds. See section 21.4.5 for details. One
possibility is that it is difficult to coordinate the learning rate of the stochastic gradi-
ent algorithm with the number of Gibbs steps used in the negative phase of stochastic
maximum likelihood. SML relies on the learning rate being small enough relative to
the number of Gibbs steps that the Gibbs chain can mix again after each update to the
model parameters. The distribution represented by the model can change very rapidly
during the earlier parts of training, and this may make it difficult for the negative
chains employed by SML to fully mix. As described in section 21.4.6, multi-prediction
deep Boltzmann machines avoid the potential inaccuracy of SML by training with a dif-
ferent objective function that is less principled but easier to compute. Another possible
explanation for the failure of joint training with mean field and SML is that the Hessian
matrix could be poorly conditioned. This perspective motivates centered deep Boltz-
mann machines, presented in section 21.4.7, which modify the model family in order to
obtain a better conditioned Hessian matrix.
21.4.5 Layerwise Pretraining
The original and most popular method for overcoming the joint training problem of
DBMs is greedy layerwise pretraining. In this method, each layer of the DBM is trained
in isolation as an RBM. The first layer is trained to model the input data. Each subse-
quent RBM is trained to model samples from the previous RBM’s posterior distribution.
After all of the RBMs have been trained in this way, they can be combined to form a
DBM. The DBM may then be trained with PCD. Typically PCD training will only make
397
c)
d)
b)
a)
Figure 21.3: The deep Boltzmann machine training procedure used to obtain the state of
the art classification accuracy on the MNIST dataset (Srivastava et al., 2014; Salakhut-
dinov and Hinton, 2009a). a) Train an RBM by using CD to approximately maximize
logP(v). b) Train a second RBM that models h
(1)
and Y by using CD-k to approx-
imately maximize log P (h
(1)
, Y ) where h
(1)
is drawn from the first RBM’s posterior
conditioned on the data. Increase k from 1 to 20 during learning. c) Combine the two
RBMs into a DBM. Train it to approximately maximize logP(v, y) using stochastic
maximum likelihood with k = 5. d) Delete Y from the model. Define a new set of
features h
(1)
and h
(
2) that are obtained by running mean field inference in the model
lacking Y . Use these features as input to an MLP whose structure is the same as an
additional pass of mean field, with an additional output layer for the estimate of Y .
Initialize the MLP’s weights to be the same as the DBM’s weights. Train the MLP to
approximately maximize log P (Y | v) using stochastic gradient descent and dropout.
Figure reprinted from (Goodfellow et al., 2013b).
a small change in the model’s parameters and its performance as measured by the log
likelihood it assigns to the data, or its ability to classify inputs.
Note that this greedy layerwise training procedure is not just coordinate ascent. It
bears some passing resemblance to coordinate ascent because we optimize one subset
of the parameters at each step. However, in the case of the greedy layerwise training
procedure, we actually use a different objective function at each step.
TODO: details of combining stacked RBMs into a DBM TODO: partial mean field
negative phase
21.4.6 Multi-Prediction Deep Boltzmann Machines
TODO– cite stoyanov TODO
21.4.7 Centered Deep Boltzmann Machines
TODO
This chapter has described the tools needed to fit a very broad class of probabilistic
models. Which tool to use depends on which aspects of the log-likelihood are problem-
atic.
398
Figure 21.4: TODO caption and label, reference from text Figure reprinted from (Good-
fellow et al., 2013b).
399
For the simplest distributions p, the log likelihood is tractable, and the model can be
fit with a straightforward application of maximum likelihood estimation and gradient
ascent as described in chapter
In this chapter, I’ve shown what to do in two different difficult cases. If Z is in-
tractable, then one may still use maximum likelihood estimation via the sampling ap-
proximation techniques described in section 19.2. If p(h | v) is intractable, one may still
train the model using the negative variational free energy rather than the likelihood, as
described in 20.4.
It is also possible that both of these difficulties will arise. An example of this occurs
with the deep Boltzmann machine (Salakhutdinov and Hinton, 2009b), which is essential
a sequence of RBMs composed together. The model is depicted graphically in Fig. 21.1c.
This model still has the same problem with computing the partition function as the
simpler RBM does. It has also discarded the restricted structure that made P (h | v)
easy to represent in the RBM. The typical way to train the DBM is to minimize the
variational free energy rather than maximize the likelihood. Of course, the variational
free energy still depends on the partition function, so it is necessary to use sampling
techniques to approximate its gradient.
TODO: k-NADE
21.5 Boltzmann Machines for Real-Valued Data
While Boltzmann machines were originally developed for use with binary data, many
applications such as image and audio modeling seem to require the ability to represent
probability distributions over real values. In some cases, it is possible to treat real-
valued data in the interval [0, 1] as representing the expectation of a binary variable
(TODO cite some examples). However, this is not a particularly theoretically satisfying
approach.
21.5.1 Gaussian-Bernoulli RBMs
TODO– cite exponential family harmoniums? TODO– multiple ways of parametrizing
them (citations?)
21.5.2 mcRBMs
TODO–mcRBMs
2
TODO–HMC
21.5.3 mPoT Model
TODO–mPoT
2
The term “mcRBM” is pronounced by saying the name of the letters M-C-R-B-M; the ”mc” is not
pronounced like the “Mc” in “McDonald’s.”
400
21.5.4 Spike and Slab Restricted Boltzmann Machines
Spike and slab restricted Boltzmann machines (Courville et al., 2011) or ssRBMs provide
another means of modeling the covariance structure of real-valued data. Compared to
mcRBMs, ssRBMs have the advantage of requiring neither matrix inversion nor Hamil-
tonian Monte Carlo methods.
The spike and slab RBM has two sets of hidden units: the spike units h which
are binary, and the slab units s which are real-valued. The mean of the visible units
conditioned on the hidden units is given by (h s)W
>
. In other words, each column
W
:,i
defines a component that can be appear in the input. The corresponding spike
variable h
i
determines whether that component is present at all. The corresponding
slab variable s
i
determines the brightness of that component, if it is present. When
a spike variable is active, the corresponding slab variable adds variance to the input
along the axis defined by W
:,i
. This allows us to model the covariance of the inputs.
Fortunately, contrastive divergence and persistent contrastive divergence with Gibbs
sampling are still applicable. There is no need to invert any matrix.
Gating by the spike variables means that the true marginal distribution over h s
is sparse. This is different from sparse coding, where samples from the model “almost
never” (in the measure theoretic sense) contain zeros in the code, and MAP inference is
required to impose sparsity.
The primary disadvantage of the spike and slab restricted Boltzmann machine is
that some settings of the parameters can correspond to a covariance matrix that is not
positive definite. Such a covariance matrix places more unnormalized probability on
values that are farther from the mean, causing the integral over all possible outcomes to
diverge. Generally this issue can be avoided with simple heuristic tricks. There is not yet
any theoretically satisfying solution. Using constrained optimization to explicitly avoid
the regions where the probability is undefined is difficult to do without being overly
conservative and also preventing the model from accessing high-performing regions of
parameter space.
Qualitatively, convolutional variants of the ssRBM produce excellent samples of nat-
ural images. Some examples are shown in Fig. 14.1.
The ssRBM allows for several extensions. Including higher-order interactions and
average-pooling of the slab variables (Courville et al., 2014) enables the model to learn
excellent features for a classifier when labeled data is scarce. Adding a term to the
energy function that prevents the partition function from becoming undefined results
in a sparse coding model, spike and slab sparse coding (Goodfellow et al., 2013c), also
known as S3C.
21.6 Convolutional Boltzmann Machines
As seen in chapter 9, extremely high dimensional inputs such as images place great
strain on the computation, memory, and statistical requirements of machine learning
models. Replacing matrix multiplication by discrete convolution with a small kernel is
the standard way of solving these problems for inputs that have translation invariant
401
spatial or temporal structure. Desjardins and Bengio (2008) showed that this approach
works well when applied to RBMs.
Deep convolutional networks usually require a pooling operation so that the spatial
size of each successive layer decreases. Feedforward convolutional networks often use
a pooling function such as the maximum of the elements to be pooled. It is unclear
how to generalize this to the setting of energy-based models. We could introduce a
binary pooling unit p over n binary detector units d and enforce p = max
i
d
i
by setting
the energy function to be whenever that constraint is violated. This does not scale
well though, as it requires evaluating 2
n
different energy configurations to compute the
normalization constant. For a small 3 ×3 pooling region this requires 2
9
= 512 energy
function evaluations per pooling unit!
Lee et al. (2009) developed a solution to this problem called probabilistic max pool-
ing (not to be confused with “stochastic pooling,” which is a technique for implicitly
constructing ensembles of convolutional feedforward networks). The strategy behind
probabilistic max pooling is to constrain the detector units so at most one may be ac-
tive at a time. This means there are only n + 1 total states (one state for each of the
n detector units being on, and an additional state corresponding to all of the detector
units being off). The pooling unit is on if and only if one of the detector units is on.
The state with all units off is assigned energy zero. We can think of this as describing
a model with a single variable that has n + 1 states, or equivalently as model that has
n + 1 variables that assigns energy to all but n + 1 joint assignments of variables.
While efficient, probabilistic max pooling does force the detector units to be mutually
exclusive, which may be a useful regularizing constraint in some contexts or a harmful
limit on model capacity in other contexts. It also does not support overlapping pooling
regions. Overlapping pool regions are usually required to obtain the best performance
from feedforward convolutional networks, so this constraint probably greatly reduces
the performance of convolutional Boltzmann machines.
Lee et al. (2009) demonstrated that probabilistic max pooling could be used to build
convolutional deep Boltzmann machines
3
. This model is able to perform operations such
as filling in missing portions of its input. However, it has not proven especially useful as
a pretraining strategy for supervised learning, performing similarly to shallow baseline
models introduced by Pinto et al. (2008).
TODO: comment on partition function changing when you change the image size,
boundary issues
21.7 Other Boltzmann Machines
TODO–Conditional Boltzmann machine TODO-RNN-RBM TODO–discriminative Boltz-
mann machine TODO–Heng’s class relevant and irrelevant Boltzmann machines TODO–
Honglak’s recent work
3
The publication describes the model as a ”deep belief network” but because it can be described
as a purely undirected model with tractable layer-wise mean field fixed point updates, it best fits the
definition of a deep Boltzmann machine.
402
21.8 Directed Generative Nets
TODO: sigmoid belief nets TODO: refer back to DBN TODO sparse coding TODO
deconvolutional nets? TODO refer back to S3C and BSC TODO: NADE will be in
RNN chapter, refer back ot it here make sure k-NADE and multi-NADE are mentioned
somewhere TODO: refer to DARN and NVIL? TODO: Stochastic Feedforward nets
21.8.1 Sigmoid Belief Nets
21.8.2 Variational Autoencoders
The variational autoencoder is model
L (21.21)
TODO
21.8.3 Variational Interpretation of PSD
TODO, develop the explanation of Sec. 9.1 of Bengio et al. (2013c).
21.8.4 Generative Adversarial Networks
Generative adversarial networks (TODO cite) are another kind of generative model
based on differentiable mappings from input noise to samples that resemble the data.
In this sense, they closely resemble variational autoencoders. However, the training
procedure is different, and generative model is not necessarily coupled with an inference
network. It is theoretically possible to train an inference network using a strategy
similar to the wake-sleep algorithm, but there is no need to infer posterior variables
during training.
Generative adversarial networks are based on game theory. A generator network is
trained to map input noise z to samples x. This function g(z) defines the generative
model. The distribution p(z) is not learned; it is simply fixed to some distribution at the
start of training (usually a very unstructured distribution such as a normal or uniform
distribution). We can think of G(z) as defining a conditional distribution
p(x | z) = N(x | g
z),
1
β
I
,
but in all learning rules we take limit as β so we can treat g(z) itself as a sample
and ignore the parametrization of the output distribution.
The generator g is pitted against an adversary: a discriminator network d. The
discriminator network receives data or samples x as input and outputs its estimate
of the probability that x was sampled from the data rather than the model. During
training, d tries to maximize and g tries to minimize a value function measuring the log
probability of d being correct:
g
= arg min
g
max
d
V (g, d)
403
where
v(g, d) = E
xp
data
log d(x) + E
x
P
p
model
log (1 d(x)) .
The optimization of g can be done simply by backpropagating through d then g,
so the learning process requires neither approximate inference nor approximation of a
partition function gradient. In the case where max
d
v(g, d) is convex (such as the case
where optimization is performed directly in the space of probability density functions)
then the procedure is guaranteed to converge and is asymptotically consistent. In prac-
tice, the procedure can be difficult to make work, because it can be difficult to keep d
optimized well enough to provide a good estimate of how to update g at all times.
21.9 A Generative View of Autoencoders
Many kinds of autoencoders can be viewed as probabilistic models. Different autoen-
coders can be interpreted as probabilistic models in different ways.
One of the first probabilistic interpretations of autoencoders was the view denois-
ing autoencoders as energy-based models trained using regularized score matching. See
Sections 16.8.1 and 19.5 for details. Since the early work (Vincent, 2011a) made the
connection with Gaussian RBMs, this gave denoising auto-encoders with a particu-
lar parametrization a generative interpretation (they could be sampled from using the
MCMC sampling techniques for Gaussian RBMs) .
The next milestone in connecting auto-encoders with a generative interpretation
came with the work of Rifai et al. (2012). It relied on the view of contractive auto-
encoders as estimators of the tangent of the manifold near which probability concen-
trates, discussed in Section 16.9 (see also Figures 16.9, 18.3). In this context, Rifai
et al. (2012) demonstrated experimentally that good samples could be obtained from a
trained contractive auto-encoder by alternating encoding, decoding, and adding noise
in a particular way.
As discussed in Section 16.8.1, the application of the encoder/decoder pair moves
the input configuration towards a more probable one. This can be exploited to actually
sample from the estimated distribution. If you consider most Monte-Carlo Markov
Chain (MCMC) algorithms, they have two elements:
1. move from lower probability configurations towards higher probability configura-
tions, and
2. inject randomness so that the chain moves around (and does not stay stuck at
some peak of probability, or mode) and has a chance to visit every configuration
in the whole space, with a relative frequency equal to its probability under the
underlying model.
So conceptually all one needs to do is to perform encode-decode operations (go towards
more probable configurations) as well as inject noise (to move around the probable
configurations), as hinted at in (Mesnil et al., 2012; Rifai et al., 2012).
404
21.9.1 Markov Chain Associated with any Denoising Auto-Encoder
The above discussion left open the question of what noise to inject and where, in order
to obtain a Markov chain that would generate from the distribution estimated by the
auto-encoder. Bengio et al. (2013b) showed how to construct such a Markov chain for
generalized denoising autoencoders. Generalized denoising autoencoders are specified by
a denoising distribution for sampling an estimate of the clean input given the corrupted
input.
x
t
˜x
t
!
t
= g(h
t
)
h
t
= fx
t
)
f
g
P (X|!)
C(
˜
X|X)
x
t+1
Figure 21.5: Each step of the Markov chain associated with a trained denoising auto-
encoder, that generates the samples from the probabilistic model implicitly trained by
the denoising reconstruction criterion. Each step consists in (a) injecting corruption
C in state x, yielding
˜
x, (b) encoding it with f, yielding h = f (
˜
x), (c) decoding the
result with g, yielding parameters ω for the reconstruction distribution, and (d) given
ω, sampling a new state from the reconstruction distribution P (x|ω = g(f (
˜
x))). In the
typical squared reconstruction error case, g(h) =
ˆ
x, which estimates E[x|
˜
x], corruption
consists in adding Gaussian noise and sampling from P (x|ω) consists in adding another
Gaussian noise to the reconstruction
ˆ
x. The latter noise level should correspond to the
mean squared error of reconstructions, whereas the injected noise is a hyper-parameter
that controls the mixing speed as well as the extent to which the estimator smoothes the
empirical distribution (Vincent, 2011b). In the figure, only the C and P conditionals
are stochastic steps (f and g are deterministic computations), although noise can also
be injected inside the auto-encoder, as in generative stochastic networks (Bengio et al.,
2014b)
Each step of the Markov chain that generates from the estimated distribution consists
of the following sub-steps, illustrated in Figure 21.5:
1. starting from the previous state x, inject corruption noise, sampling
˜
x from
405
C(
˜
x|x).
2. Encode
˜
x into h = f(
˜
x).
3. Decode h to obtain the parameters ω = g(h) of P(x|ω = g(h)) = P (x|
˜
x).
4. Sample the next state x from P (x|ω = g(h)) = P (x|
˜
x).
The theorem states that if the auto-encoder P (x|
˜
x) forms a consistent estimator of
corresponding true conditional distribution, then the stationary distribution of the above
Markov chain forms a consistent estimator (albeit an implicit one) of the data generating
distribution of x.
P (
˜
x|x)
1
P (
˜
x|x)
P (
˜
x|x)
P (x|
˜
x)
C
Figure 21.6: Illustration of one step of the sampling Markov chain associated with a
denoising auto-encoder (see also Figure 21.5). In the figure, the data (black circles) are
sitting near a low-dimensional manifold (a spiral, here), and the two stochastic steps
of the Markov chain are first to corrupt x (clean image of dog, green circle) into
˜
x
(noisy image of dog, blue circle) via C(
˜
x|x) (here an isotropic Gaussian noise in green),
and then to sample a new x via the estimated denoising P (x|
˜
x). Note how there are
many possible x which could have given rise to
˜
x, and these all lie on the manifold in
the neighborhood of
˜
x, hence the flattened shape of P (x|
˜
x) (in blue). Modified from a
figure first created and graciously authorized by Jason Yosinski.
Figure 21.6 illustrates the sampling process of the DAE in a way that complements
Figure 21.5, with a specific imagined example. For a more elaborate discussion of the
406
probabilistic nature of denoising auto-encoders, and their generalization (Bengio et al.,
2014b), Generative Stochastic Networks (GSNs), see Section 21.10 below. In particular,
the noise does not have to be injected only in the input, and it could be injected anywhere
along the chain. GSNs also generalize DAEs by allowing the state of the Markov chain
to be extended beyond the visible variable x, to include also some latent variable h.
Finally, Section 21.10 discusses training strategies for DAEs that are aimed at making
it a better generative model and not just a good feature learner.
Figure 21.7: Illustration of the effect of the walk-back training procedure, used for
denoising auto-encoders or GSNs in general. The objective is to remove spurious modes
faster by letting the Markov chain go towards them (along the red path, starting on
the purple data manifold and following the arrows plus noise), and then punishing the
Markov chain for this behavior (i.e., walking back to the right place) by telling the chain
to return towards the data manifold (reconstruct the original data).
21.9.2 Clamping and Conditional Sampling
Similarly to Boltzmann machines, denoising auto-encoders and GSNs can be used to
sample from a conditional distribution P (x
f
|x
o
), simply by clamping the observed units
x
f
and only resampling the free units x
o
given x
f
and the sampled latent variables (if
any). This has been introduced by Bengio et al. (2014b).
However, note that Proposition 1 of that paper is missing a condition: the transition
operator (defined by the stochastic mapping going from one state of the chain to the
next) should satisfy detailed balance, described in Section 15.1.1.
An experiment in clamping half of the pixels (the right part of the image) and
running the Markov chain on the other half is shown in Figure 21.8.
407
Figure 21.8: Illustration of clamping the right half of the image and running the Markov
by resampling only the left half at each step. These samples come from a GSN trained
to reconstruct MNIST digits at each time step, i.e., using the walkback procedure.
21.9.3 Walk-Back Training Procedure
The walk-back training procedure was proposed by Bengio et al. (2013b) as a way to
speed-up the convergence of generative training of denoising auto-encoders. Instead
of performing a one-step encode-decode reconstruction, this procedure consists in al-
ternative multiple stochastic encode-decode steps (as in the generative Markov chain)
initialized at a training example (just like with the contrastive divergence algorithm, de-
scribed in Sections 19.2) and 21.2.1) and penalizing the last probabilistic reconstructions
(or all of the reconstructions along the way).
It was shown in that paper that training with k steps is equivalent (in the sense of
achieving the same stationary distribution) as training with one step, but practically
has the advantage that spurious modes farther from the data can be removed more
efficiently, as illustrated in Figure 21.7.
Figure 21.9 illustrates the application of the walk-back procedure in a generative
stochastic network, which is described in the next section.
408
X
0
H
1
X
1
H
2
H
3
X
2
H
0
W
1
W
1
W
1
W
2
W
2
W
2
W
3
W
3
X
1
H
1
X
2
X
3
H
2
H
3
W
3
X
0
data target target target
Figure 21.9: Left: graphical model of the generative Markov chain associated with a gen-
erative stochastic network (GSN). Right specific case where the latent variable is formed
of several layers, each connected to the one above and the one below, making the genera-
tive process very similar to Gibbs sampling in a deep Boltzmann machine (Salakhutdinov
and Hinton, 2009b). The walk-back training procedure is used, i.e., at every step the re-
construction probability distribution is pushed towards generating the training example
(which also initializes the chain).
21.10 Generative Stochastic Networks
Generative stochastic networks (Bengio et al., 2014b) or GSNs are generalizations of
denoising auto-encoders that include latent variables in the generative Markov chain, in
addition to the visible variables (usually denoted x). The generative Markov chain looks
like the one in Figure 21.10. An example of a GSN structured like a deep Boltzmann
machine and trained by the walk-back procedure is shown in Figure 21.9.
X
2
X
0
X
1
H
0
H
1
H
2
Figure 21.10: Markov chain of a GSN (Generative Stochastic Network) with latent
variables with H and visible variable X, i.e., an unfolding of the generative process with
X
k
and H
k
at step k of the chain.
A GSN is parametrized by two conditional probability distributions which specify
one step of the Markov chain:
1. P (X
k
|H
k
) tells how to generate the next visible variable given the current la-
tent state. Such a “reconstruction distribution” is also found in denoising auto-
encoders, RBMs, DBNs and DBMs.
2. P (H
k
|H
k1
, X
k1
) tells how to update the latent state variable, given the previous
latent state and visible variable.
Denoising auto-encoders and GSNs differ from classical probabilistic models (di-
rected or undirected) in that it parametrizes the generative process itself rather than
the mathematical specification of the joint distribution of visible and latent variables.
409
Instead, the latter is defined implicitly, if it exists, as the stationary distribution of the
generative Markov chain. The conditions for existence of the stationary distribution are
mild (basically, the chain mixes) but can be violated by some choices of the transition
distributions (for example, if they were deterministic).
One could imagine different training criteria for GSNs. The one proposed and eval-
uated by Bengio et al. (2014b) is simply reconstruction log-probability on the visible
units, just like for denoising auto-encoders. This is achieved by clampling X
0
= x to the
observed example and maximizing the probability of generating x at some subsequent
time steps, i.e., maximizing log P (X
k
= x|H
k
), where H
k
is sampled from the chain,
given X
0
= x. In order to estimate the gradient of log P (X
k
= x|H
k
) with respect to
the other pieces of the model, Bengio et al. (2014b) use the reparametrization trick,
introduced in Section 14.5.1.
The walk-back training protocol (described in Section 21.9.3 was used (Bengio et al.,
2014b) to improve training convergence of GSNS.
21.10.1 Discriminant GSNs
Whereas the original formulation of GSNs (Bengio et al., 2014b) was meant for unsu-
pervised learning and implicitly modeling P (x) for observed data x, it is possible to
modify the framework to optimize P (y|x).
For example, Zhou and Troyanskaya (2014) generalize GSNs in this way, by only
back-propagating the reconstruction log-probability over the output variables, keeping
the input variables fixed. They applied this successfully to model sequences (protein
secondary structure) and introduced a (one-dimensional) convolutional structure in the
transition operator of the Markov chain. Keep in mind that, for each step of the Markov
chain, one generates a new sequence for each layer, and that sequence is the input for
computing other layer values (say the one below and the one above) at the next time
step, as illustrated in Figure 21.11.
Figure 21.11: Markov chain arising out of a discriminant GSN, i.e., where a GSN is used
as a structured output model over a variable Y , conditioned on an input X. Reproduced
with permission from Zhou and Troyanskaya (2014). The structure is as in a GSN (over
the output) but with computations being conditioned on the input X at each step.
Hence the Markov chain is really over the output variable (and associated higher-
410
level hidden layers), and the input sequence only serves to condition that chain, with
back-propagation allowing to learn how the input sequence can condition the output
distribution implicitly represented by the Markov chain. It is therefore a case of using
the GSN in the context of structured outputs, where P (y|x) does not have a simple
parametric form but instead the components of y are statistically dependent of each
other, given x, in complicated ways.
ohrer and Pernkopf (2014) considered a hybrid model that combines a supervised
objective (as in the above work) and an unsupervised objective (as in the original GSN
work), by simply adding (with a different weight) the supervised and unsupervised costs
i.e., the reconstruction log-probabilities of y and x respectively. Such a hybrid criterion
had previously been introduced for RBMs by Larochelle and Bengio (2008a). They show
improved classification performance using this scheme.
21.11 Methodological Notes
Researchers studying generative models often need to compare one generative model
to another, usually in order to demonstrate that a newly invented generative model is
better at capturing some distribution than the pre-existing models.
This can be a difficult and subtle task. In many cases, we can not actually evaluate
the log probability of the data under the model, but only an approximation. In these
cases, it’s important to think and communicate clearly about exactly what is being mea-
sured. For example, suppose we can evaluate a stochastic estimate of the log likelihood
for model A, and a deterministic lower bound on the log likelihood for model B. If model
A gets a higher score than model B, which is better? If we care about determining which
model has a better internal representation of the distribution, we actually cannot tell,
unless we have some way of determining how loose the bound for model B is. However,
if we care about how well we can use the model in practice, for example to perform
anomaly detection, then it is fair to say that model A is better based on a criterion spe-
cific to the practical task of interest, e.g., based on ranking test examples and ranking
criterian such as precision and recall.
Another subtletly of evaluating generative models is that the evaluation metrics are
often hard research problems in and of themselves. It can be very difficult to establish
that models are being compared fairly. For example, suppose we use AIS to estimate
log Z in order to compute log ˜p(x) log Z for a new model we have just invented. A
computationally economical implementation of AIS may fail to find several modes of the
model distribution and underestimate Z, which will result in us overestimating log p(x).
It can thus be difficult to tell whether a good likelihood estimate is due to a good model
or a bad AIS implementation.
Other fields of machine learning usually allow for some variation in the preprocessing
of the data. For example, when comparing the accuracy of object recognition algorithms,
it is usually acceptable to preprocess the input images slightly differently for each algo-
rithm based on what kind of input requirements it has. Generative modeling is different
because changes in preprocessing, even very small and subtle ones, are completely un-
411
acceptable. Any change to the input data changes the distribution to be captured and
fundamentally alters the task. For example, multiplying the input by 0.1 will artificially
increase likelihood by 10.
Issues with preprocessing commonly arise when benchmarking generative models on
the MNIST dataset, one of the more popular generative modeling benchmarks. MNIST
consists of grayscale images. Some models treat MNIST images as points in a real
vector space, while others treat them as binary. Yet others treat the grayscale values
as probabilities for a binary samples. It is essential to compare real-valued models
only to other real-valued models and binary-valued models only to other binary-valued
models. Otherwise the likelihoods measured are not on the same space. (For the binary-
valued models, the log likelihood can be at most 0., while for real-valued models it can
be arbitrarily high, since it is the measurement of a density) Among binary models,
it is important to compare models using exactly the same kind of binarization. For
example, we might binarize a gray pixel to 0 or 1 by thresholding at 0.5, or by drawing
a random sample whose probability of being 1 is given by the gray pixel intensity.
If we use the random binarization, we might binarize the whole dataset once, or we
might draw a different random example for each step of training and then draw multiple
samples for evaluation. Each of these three schemes yields wildly different likelihood
numbers, and when comparing different models it is important that both models use
the same binarization scheme for training and for evaluation. In fact, researchers who
apply a single random binarization step share a file containing the results of the random
binarization, so that there is no difference in results based on different outcomes of the
binarization step.
Finally, in some cases the likelihood seems not to measure any attribute of the
model that we really care about. For example, real-valued models of MNIST can obtain
arbitrarily high likelihood by assigning arbitrarily low variance to background pixels
that never change. Models and algorithms that detect these constant features can reap
unlimited rewards, even though this is not a very useful thing to do. This strongly
suggests a need for developing other ways of evaluating generative models.
Although this is still an open question, this might be achieved by converting the
problem into a classification task. For example, we have seen that the NCE method
(Noise Contrastive Estimation, Section 19.6) compares the density of the training data
according to a learned unnormalized model with its density under a background model.
However, generative models do not always provide us with an energy function (equiva-
lently, an unnormalized density), e.g., deep Boltzmann machines, generative stochastic
networks, most denoising auto-encoders (that are not guaranteed to correspond to an
energy function), deep Belief networks, etc. Therefore, it would be interesting to con-
sider a classification task in which one tries to distinguish the training examples from the
generated examples. This is precisely what is achieved by the discriminator network of
generative adversarial networks (Section 21.8.4). However, it would require an expensive
operation (training a discriminator) each time one would have to evaluate performance
TODO– have a section on sum-product networks including a citation to James
Martens’ recent paper
412