Chapter 21

Deep Generative Models

In this chapter, we present several of the speciﬁc kinds of generative models that can

be built and trained using the techniques presented in chapters 14, 19 and 20. All

of these models represent probability distributions over multiple variables in some way.

Some allow the probability distribution function to be evaluated explicitly. Others do

not allow the evaluation of the probability distribution function, but support operations

that implicitly require knowledge of it, such as sampling. Some of these models are

structured probabilistic models described in terms of graphs and factors, as described

in chapter 14. Others can not easily be described in terms of factors, but represent

probability distributions nonetheless.

21.1 Restricted Boltzmann Machines

Restricted Boltzmann machines are some of the most common building blocks of deep

probabilistic models. They are undirected probabilistic graphical models containing a

layer of observable variables and a single layer of latent variables. RBMs may be stacked

(one on top of the other) to form deeper models. See Fig. 21.1 for some examples. In

particular, Fig. 21.1a shows the graph structure of an RBM itself. It is a bipartite graph:

with no connections permitted between any variables in the observed layer or between

any units in the latent layer.

TODO– review and pointers to other sections of the book This should be the main

place where they are described in detail, earlier they are just an example of undirected

models or an example of a feature learning algorithm.

More formally, we will consider the observed layer to consist of a set of D binary

random variables which we refer to collectively with the vector v, where the ith element,

i.e. v

is a binary random variable. We will refer to the latent or hidden layer of N

random variables collectively as h, with the jth random elements as h

The restricted Boltzmann machine is an energy-based model, which means that the

joint probability distribution is fully-speciﬁed by its energy function:

P (v, h) =

exp {−E(v, h)}.

382

(1)

(2)

(1)

(2)

(1)

Figure 21.1: Examples of models that may be built with restricted Boltzmann machines.

a) The restricted Boltzmann machine itself is an undirected graphical model based on a

bipartite graph. There are no connections among the visible units, nor any connections

among the hidden units. Typically every visible unit is connected to every hidden unit

but it is possible to construct sparsely connected RBMs such as convolutional RBMs. b)

A deep belief network is a hybrid graphical model involving both directed and undirected

connections. Like an RBM, it has no intra-layer connections. However, a DBN has

multiple hidden layers, and thus there are connections between hidden units that are in

separate layers. All of the local conditional probability distributions needed by the deep

belief network are copied directly from the local conditional probability distributions of

its constituent RBMs. Note that we could also represent the deep belief network with

a completely undirected graph, but it would need intra-layer connections to capture

the dependencies between parents. c) A deep Boltzmann machine is an undirected

graphical model with several layers of latent variables. Like RBMs and DBNs, DBMs

lack intra-layer connections. DBMs are less closely tied to RBMs than DBNs are. When

initializing a DBM from a stack of RBMs, it is necessary to modify the RBM parameters

slightly. Some kinds of DBMs may be trained without ﬁrst training a set of RBMs.

383

Where E(v, h) is the energy function that parametrizes the relationship between the

visible and hidden variables:

E(v, h) = −b

v − c

h −v

W h, (21.1)

and the Z is the normalizing constant known as the partition function:

Z =

exp {−E(v, h)}.

For many undirected models, it is apparent from the deﬁnition of the partition func-

tion Z that the naive method of computing Z (exhaustively summing over all states)

would be computationally intractable. However, it is still possible that a more cleverly

designed algorithm could exploit regularities in the probability distribution to compute

Z faster than the naive algorithm, so the exponential cost of the naive algorithm is not

a guarantee of the partiion function’s intractability. In the case of restricted Boltzmann

machines, there is actually a hardness result, proven by Long and Servedio (2010).

21.1.1 Conditional Distributions

The intractable partition function Z, implies that the joint probability distribution is

also intractable (in the sense that the normalized probability of a given joint conﬁgura-

tion of v, h is generally not available). However, due the bipartite graph structure, the

restricted Boltzmann machine has the very special property that its conditional distri-

butions P (h | v) and P (v | h) are factorial and relatively simple to compute and sample

from. Indeed, it is this property that has made the RBM a relatively popular model

for a wide range of applications including image modeling (CITE), speech processing

(CITE) and natural language processing (CITE).

Deriving the conditional distributions from the joint distribution is straightforward.

p(h | v) =

p(h, v)

p(v)

p(h, v)

p(v)

exp

v + c

h + v

W h

exp

h + v

W h

exp







j=1

:,j







j=1

exp

+ v

:,j

(21.2)

384

Since we are conditioning on the visible units v, we can treat these as constants w.r.t.

the distribution p(h | v). The factorial nature of the conditional p(h | v) follows

immediately from our ability to wright the joint probability over the vector h as the

product of (unnormalized) distributions over the individual elements, h

. It is now a

simple matter of normalizing the distributions over the individual binary h

P (h

= 1 | v) =

P (h

= 1 | v)

P (h

= 0 | v) +

P (h

= 1 | v)

exp



+ v

:,j



exp {0}+ exp {c

+ v

:,j

}

= sigmoid



+ v

:,j



. (21.3)

We can now express the full conditional over the hidden layer as the factorial distribution:

P (h | v) =

j=1

sigmoid



+ v

:,j



. (21.4)

A similar derivation will show that the other condition of interest to us, P (v | h), is

also a factorial distribution:

P (v | h) =

i=1

sigmoid (b

+ W

i,:

h) . (21.5)

21.1.2 RBM Gibbs Sampling

The factorial nature of these conditions is a very useful property of the RBM, and allows

us to eﬃciently draw samples from the joint distribution via a block Gibbs sampling

strategy (see section 15.1 for a more complete discussion of Gibbs sampling methods).

Block Gibbs sampling simply refers to the situation where in each step of Gibbs

sampling, multiple variables (or a “block” of variables) are sampled jointly. In the case

of the RBM, each iteration of block Gibbs sampling consists of two steps. Step 1:

Sample h

(l)

∼ P (h | v

(l)

). Due to the factorial nature of the conditionals, we can

simultaneously and independently sample from all the elements of h

(l)

given v

(l)

. Step

2: Sample v

(l+1)

∼ P (v | h

(l)

). Again, the factorial nature of the conditional P(v | h

(l)

)

allows us can simultaneously and independently sample from all the elements of v

(l+1)

given h

(l)

21.2 Training Restricted Boltzmann Machines

Despite the simplicity of the RBM conditionals, training these models is not without

its complications. As a probabilistic model, a sensible inductive principle for estimating

the model parameters is maximum likelihood – though other possibilities are certainly

possible Marlin et al. (2010) and will be discussed later in Sec. 21.2.3. In the following

we derive the maximum likelihood gradient with respect to the model parameters.

385

Let us consider that we have a batch (or minibatch) of n examples taken from an i.i.d

dataset (independently and identically distributed examples) {v

(1)

, . . . , v

(t)

, . . . , v

(n)

The log likelihood under the RBM with parameters b (visible unit biases), c (hidden

unit biases) and W (interaction weights) is given by:

`(W , b, c) =

t=1

log P (v

(t)

)

t=1

log

P (v

(t)

n,:

, h)

t=1

log

exp

−E(v

(t)

, h)

− n log Z

t=1

log

exp

−E(v

(t)

, h)

− n log

v,h

exp {−E(v, h)}

(21.6)

In the last line of the equation above, we have used the deﬁnition of the partition

function.

To maximize the likelihood of the data under the restricted Boltzmann machine, we

consider the gradient of the likelihood with respect to the model parameters, which we

will refer to collectively as θ = {b, c, W }:

∂

∂θ

`(θ) =

∂

∂θ

t=1

log

exp

−E(v

(t)

, h)

− n

∂

∂θ

log

v,h

exp {−E(v, h)}

t=1

exp



−E(v

(t)

, h)



∂

∂θ

− E(v

(t)

, h)

exp



−E(v

(t)

, h)



− n

v,h

exp {−E(v, h)}

∂

∂θ

− E(v, h)

v,h

exp {−E(v, h)}

t=1

P (h|v

(t)

)



∂

∂θ

− E(v

(t)

, h)



− nE

P (v,h)



∂

∂θ

− E(v, h)



(21.7)

As we can see from Eq. 21.7, the gradient of the log likelihood is speciﬁed as the diﬀerence

between two expectations of the gradient of the energy function, The ﬁrst expectation

(the data term) is with respect to the product of the empirical distribution over the data,

P (v) = 1/n

t=1

δ(x −v

(t)

)

and the conditional distribution P(h | v

(t)

). The second

expectation (the model term) is with respect to the joint model distribution P (v, h).

This diﬀerence between a data-driven term and a model-driven term is not unique to

RBMs, as discussed in some detail in Sec. 19.2, this is a general feature of the maximum

likelihood gradient for all undirected models.

We can complete the derivation of log-likelihood gradient by expanding the term:

∂

∂θ

− E(v, h). We will consider ﬁrst the gradient of the negative energy function of W .

As discussed in Sec. 3.10.4, we use the term empirical distribution to refer to a mixture over delta

functions placed on training examples

386

∂

∂W

− E(v, h) =

∂

∂W



v + c

h + v

W h



= hv

(21.8)

The gradients with respect to b and c are similarly derived:

∂

∂b

− E(v, h) = v,

∂

∂c

− E(v, h) = h (21.9)

Putting it all together we can the following equations for the gradients with respect

to the RBM parameters and given n training examples:

∂

∂W

`(W , b, c) =

t=1

(t)

(t) >

− N E

P (v,h)

∂

∂b

`(W , b, c) =

t=1

(t)

− nE

P (v,h)

[v]

∂

∂c

`(W , b, c) =

t=1

(t)

− nE

P (v,h)

[h]

where we have deﬁned

(t)

= E

P (h|v

(t)

)

[h] = sigmoid



c + v

(t)



. (21.10)

While we are able to write down these expressions for the log-likelihood gradient,

unfortunately, in most situations of interest, we are not able to use them directly to

calculate gradients. The problem is the expectations over the joint model distribution

P (v, h). While we have conditional distributions P(v | h) and P (h | v) that are easy

to work with, the RBM joint distribution is not amenable to analytic evaluation of the

expectation E

P (v,h)

[f(v, h)].

This is bad news, it implies that in most cases it is impractical to compute the exact

log-likelihood gradient. Fortunately, as discussed in Sec. 19.2, there are two widely

used approximation strategies that have been applied to the training of RBM with some

degree of success: contrastive divergence and stochastic maximum likelihood.

In the following sections we discuss two diﬀerent strategies to approximate this gra-

dient that have been applied to training the RBM. However, before getting into the

actual training algorithms, it is worth considering what general approaches are available

to us in approximating the log-likelihood gradient. As we mentioned, our problem stems

from the expectation over the joint distribution P (v, h), but we know that we have ac-

cess to factorial conditionals and that we can use these as the basis of a Gibbs sampling

procedure to recover samples form the joint distribution (as discussed in Sec. 21.1.2).

387

Thus, we can imagine using, for example, T MCMC samples from P (v, h) to form a

Monte Carlo estimate of the expectations over the joint distribution:

P (v,h)

[f(v, h)] ≈

t=1

f(v

(t)

, h

(t)

). (21.11)

There is a problem with this strategy that has to do with the initialization of the MCMC

chain. MCMC chains typically require a burn-in period, where the chain

21.2.1 Contrastive Divergence Training of the RBM

As discussed in a more general context in Sec. 19.2, Contrastive divergence (CD) seeks

to approximate the expectation over the joint distribution with samples drawn from

short Giibs sampling chains. CD deals with the typical requirement for an extended

burn-in sample sequence by initializing these chains at the data points used in the

data-dependent, conditional term. The result is a biased approximation of the log-

likelihood gradient (Carreira-Perpi˜nan and Hinton, 2005; Bengio and Delalleau, 2009;

Fischer and Igel, 2011), that never-the-less has been empirically shown to be eﬀective.

The constrastive divergence algorithm, as applied to RBMs, is given in Algorithm 21.1.

21.2.2 Stochastic Maximum Likelihood (Persistent Contrastive Diver-

gence) for the RBM

While contrastive divergence has been the most popular method of training RBMs,

the stochastic maximum likelihood (SML) algorithm (Younes, 1998; Tieleman, 2008) is

known to be a competitive alternative – especially if we are interested in recovering the

best possible generative model (i.e. achieving the highest possible test set likelihood).

As with CD, the general SML algorithm is described in Sec. 19.2. Here we are concerned

with how to apply the algorithm to training an RBM.

In comparison to the CD algorithm, SML uses an alternative solution to the prob-

lem of how to approximate the partition function’s contribution to the log-likelihood

gradient. Instead of initializing the k-step MCMC chain with the current example from

the training set, in SML we initialize the MCMC chain for training iteration s with the

last state of the MCMC chain from the last training iteration (s − 1). Assuming that

the gradient updates to the model parameters do not signiﬁcantly change the model,

the MCMC state of the last iteration should be close to the equilibrium distribution at

iteration s – minimizing the number of “burn-in” MCMC steps needed to reach equilib-

rium at the current iteration. As with CD, in practice we often use just one Gibbs step

between learning iterations. Algorithm 21.2 describes the SML algorithm as applied to

RBMs.

TODO: include experimental examples, i.e. an RBM trained with CD on MNIST

21.2.3 Other Inductive Principles

TODO:Other inductive principles have been used to train RBMs. In this section we

brieﬂy discuss these.

388

Algorithm 21.1 The contrastive divergence algorithm, using gradient ascent as the

optimization procedure.

Set , the step size, to a small positive number

Set k, the number of Gibbs steps, high enough to allow a Markov chain of p(v; θ)

to mix when initializedfrom p

data

. Perhaps 1-20 to train an RBM on a small image

patch.

while Not converged do

Sample a minibatch of m examples from the training set {v

(1)

, . . . , v

(m)

∆

←

t=1

(t)

(t) >

∆

←

t=1

(t)

∆

←

t=1

(t)

for t = 1 to m do

(t)

← v

(t)

end for

for l = 1 to k do

for t = 1 to m do

(t)

sampled from

j=1

sigmoid



(t) >

:,j



(t)

sampled from

i=1

sigmoid



+ W

i,:

(t)



end for

(t)

← sigmoid



c +

(t)



∆

← ∆

−

t=1

(t)

(t) >

∆

← ∆

−

t=1

(t)

∆

← ∆

−

t=1

(t)

W ← W + ∆

b ← b + ∆

c ← c + ∆

end while

21.3 Deep Belief Networks

Deep belief networks (DBNs) were one of the ﬁrst successful non-convolutional architec-

tures. The introduction of deep belief networks in 2006 began the current deep learning

renaissance. Prior to the introduction of deep belief networks, deep models were con-

sidered too diﬃcult to optimize, due to the vanishing and exploding gradient problems

and the existence of plateaus, negative curvature, and suboptimal local minima that can

arise in neural network objective functions. Kernel machines with convex objective func-

tions dominated the research landscape. Deep belief networks demonstrated that deep

architectures can be successful, by outperforming kernelized support vector machines

on the MNIST dataset (Hinton et al., 2006). Today, deep belief networks have mostly

fallen out of favor and are rarely used, even compared to other unsupervised or gener-

ative learning algorithms, but they are still deservedly recognized for their important

389

Algorithm 21.2 The stochastic maximum likelihood / persistent contrastive divergence

algorithm for training an RBM.

Set , the step size, to a small positive number

Set k, the number of Gibbs steps, high enough to allow a Markov chain of p(v, h; θ +

∆

) toburn in, starting from samples from p(v, h; θ). Perhaps 1 for RBM on a small

image patch.

Initialize a set of m samples {

(1)

, . . . ,

(m)

} to random values (e.g., from a uniform or

normal distribution, or possibly a distribution with marginals matched to the model’s

marginals)

while Not converged do

Sample a minibatch of m examples {v

(1)

, . . . , v

(m)

} from the training set.

∆

←

t=1

(t)

(t) >

∆

←

t=1

(t)

∆

←

t=1

(t)

for l = 1 to k do

for t = 1 to m do

(t)

sampled from

j=1

sigmoid



(t) >

:,j



(t)

sampled from

i=1

sigmoid



+ W

i,:

(t)



end for

g ← g −

i=1

∇

log ˜p(

(i)

; θ)

θ ← θ + g

end while

role in deep learning history.

Deep belief networks are generative models with several layers of latent variables.

The latent variables are typically binary, and the visible units may be binary or real.

There are no intra-layer connections. Usually, every unit in each layer is connected to

every unit in each neighboring layer, though it is possible to construct more sparsely

connected DBNs. The connections between the top two layers are undirected. The

connections between all other layers are directed, with the arrows pointed toward the

layer that is closest to the data. See Fig. 21.1b for an example.

A DBN with L hidden layers contains L weight matrices: W

(1)

, . . . , W

(L)

. It also

contains L + 1 bias vectors: b

(0)

, . . . , b

(L)

with b

(0)

providing the biases for the visible

layer. The probability distribution represented by the DBN is given by

p(h

(L)

, h

(L−1)

) ∝ exp



(L)

+ b

(L−1)>

(L−1)

+ h

(L−1)>

(L)



p(h

(l)

= 1 | h

(l+1)

) = σ



(l)

+ W

(l+1)>

:,i

(l+1)



∀i, ∀l ∈ 1, . . . , L −2,

p(v

= 1 | h

(1)

) = σ



(0)

+ W

(1)>

:,i

(1)



∀i.

390

In the cause of real-valued visible units, substitute

v ∼ N



v | b

(0)

+ W

(1)>

(1)

, β

−1



with β diagonal for tractability. Generalizations to other exponential family visible units

are straightforward, at least in theory. Note that a DBN with only one hidden layer is

just an RBM.

To generate a sample from a DBN, we ﬁrst run several steps of Gibbs sampling on

the top two hidden layers. This stage is essentially drawing a sample from the RBM

deﬁned by the top two hidden layers. We can then use a single pass of ancestral sampling

through the rest of the model to draw a sample from the visible units.

Inference in a deep belief network is intractable due to the explaining away eﬀect

within each directed layer, and due to the interaction between the two ﬁnal hidden layers.

Evaluating or maximizing the standard evidence lower bound on the log likelihood is also

intractable, because the evidence lower bound takes the expectation of cliques whose

size is equal to the network width.

Evaluating or maximizing the log likelihood requires not just confronting the problem

of intractable inference to marginalize out the latent variables, but also the problem of

an intractable partition function within the undirected model of the last two layers.

As a hybrid of directed and undirected models, deep belief networks encounter many

of the diﬃculties associated with both families of models. Because deep belief networks

are partially undirected, they require Markov chains for sampling and have an intractable

partition function. Because they are directed and generally consist of binary random

variables, their evidence lower bound is intractable.

TODO–training procedure TODO–discriminative ﬁne-tuning TODO–view of MLP

as variational inference with very loose bound comment on how this does not capture

intra-layer explaining away interactions comment on how this does not capture inter-

layer feedback interactions TODO–quantitative analysis with AIS TODO–wake sleep?

The term “deep belief network” is commonly used incorrectly to refer to any kind of

deep neural network, even networks without latent variable semantics. The term “deep

belief network” should refer speciﬁcally to models with undirected connections in the

deepest layer and directed connections pointing downward between all other pairs of

sequential layers.

The term “deep belief network” may also cause some confusion because the term

“belief network” is sometimes used to refer to purely directed models, while deep belief

networks contain an undirected layer. Deep belief networks also share the acronym DBN

with dynamic Bayesian networks, which are Bayesian networks for representing Markov

chains.

21.4 Deep Boltzmann Machines

A deep Boltzmann machine (DBM) is another kind of deep, generative model (Salakhut-

dinov and Hinton, 2009a). Unlike the deep belief network (DBN), it is an entirely undi-

rected model. Unlike the RBM, the DBM has several layers of latent variables (RBMs

391

have just one). But like the RBM, within each layer, each of the variables are mutually

independent, conditioned on the variables in the neighboring layers. See Fig. 21.2 for

the graph structure.

Like RBMs and DBNs, DBMs typically contain only binary units – as we assume in

our development of the model – but it may sometimes contain real-valued visible units.

A DBM is an energy-based model, meaning that the the joint probability distribution

over the model variables is parametrized by an energy function E. In the case of a deep

Boltzmann machine with one visible layer, v, and three hidden layers, h

(1)

, h

(2)

andh

(3)

the joint probability is given by:



v, h

(1)

, h

(2)

, h

(3)



Z(θ)

exp



−E(v, h

(1)

, h

(2)

, h

(3)

; θ)



. (21.12)

The DBM energy function is:

E(v, h

(1)

, h

(2)

, h

(3)

; θ) = −v

(1)

− h

(1)>

(2)

− h

(2)>

(3)

. (21.13)

In comparison to the RBM energy function (Eq. 21.1), the DBM energy function in-

cludes connections between the hidden units (latent variables) in the form of the weight

matrices (W

(2)

and W

(3)

). As we will see, these connections have signiﬁcant conse-

quences for both the model behavior as well as how we go about performing inference

in the model.

In comparison to fully connected Boltzmann machines (with every unit connectioned

to every other unit), the DBM oﬀers some similar advantages as oﬀered by the RBM.

Speciﬁcally, as illustrated in Fig. (TODO: include ﬁgure), the DBM layers can be

organized into a bipartiite graph, with odd layers on one side and even layers on the

other. This immediately implies that when we condition on the variables in the even

layer, the variables in the odd layers become conditionally independent. Of course,

when we condition on the variables in the odd layers, the variables in the even layers

also become conditionally independent.

We show this explicitly for the conditional distribution P(h

(1)

= 1 | v, h

(2)

), in the

case of a DBM with two hidden layers (of course, this result generalizes to a DBM with

392

(1)

(2)

(1)

Figure 21.2: The deep Boltzmnn machine (biases on all units are present but suppressed

to simplify notation).

393

any number of layers).

P (h

(1)

| v, h

(2)

) =

P (h

(1)

, v, h

(2)

)

P (v, h

(2)

)

exp



(1)

+ h

(1)>

(2)



(1)

···

(1)

exp



(1)

+ h

(1)>

(2)



exp



(1)

+ h

(1)>

(2)



(1)

···

(1)

exp



(1)

+ h

(1)>

(2)



exp



j=1

(1)

:,j

(1)

+ h

(1)>

(2)

j,:

(2)



(1)

···

(1)

exp



(1)

:,j

(1)

+ h

(1)>

(2)



exp



(1)

:,j

(1)

+ h

(1)>

(2)

j,:

(2)



(1)

···

(1)

exp



(1)

:,j

(1)

+ h

(1)>

(2)



exp



(1)

:,j

(1)

+ h

(1)>

(2)

j,:

(2)



(1)

exp



(1)

:,j

(1)

+ h

(1)>

(2)

j,:

(2)



exp



(1)

:,j

(1)

+ h

(1)>

(2)

j,:

(2)



1 + exp



(1)

:,j

+ W

(2)

j,:

(2)



exp



(1)

:,j

(1)

+ h

(1)>

(2)

j,:

(2)



1 + exp



(1)

:,j

+ W

(2)

j,:

(2)



P (h

(1)

| v, h

(2)

). (21.14)

From the above we can conclude that the conditional distribution for any layer of the

DBM conditioned on the neighboring layers, is fractorial (i.e. all variables in the layer

are conditionally independent). Further, we’ve shown that this conditional distribution

is given by a logistic sigmoid function:

P (h

(1)

= 1 | v, h

(2)

) =

exp



(1)

:,j

+ W

(2)

j,:

(2)



1 + exp



(1)

:,j

+ W

(2)

j,:

(2)



1 + exp



−v

(1)

:,j

− W

(2)

j,:

(2)



= sigmoid



(1)

:,j

+ W

(2)

j,:

(2)



. (21.15)

For the two layer DBM, the conditional distributions of the remaining two layers

(v, h

(2)

) also factorize. That is P (v | h

(1)

) =

i=1

P (v

| h

(1)

), where

P (v

= 1 | h

(1)

) = sigmoid



(1)

i,:

(1)



. (21.16)

394

Also, P (h

(2)

| h

(1)

) =

k=1

P (h

(2)

| h

(1)

), where

P (h

(2)

= 1 | h

(1)

) = sigmoid



(1)>

(2)

:,k



. (21.17)

21.4.1 Interesting Properties

TODO: comparison to DBNs TODO: comparison to neuroscience (local learning) “most

biologically plausible” TODO: description of easy mean ﬁeld TODO: description of

sampling, comparison to general Boltzmann machines,DBNs

21.4.2 Mean Field Inference in the DBM

For the two hidden layer DBM, the conditional distributions, P (v | h

(1)

), P (h

(1)

v, h

(2)

), and P(h

(2)

| h

(1)

) are factorial, however the posterior distribution over all the

hidden units given the visible unit, i.e. P(h

(1)

, h

(2)

| v), can be complicated. This is, of

course, due to the interaction weights W

(2)

between h

(1)

and h

(2)

which render these

variables mutually dependent, given an observed v.

So, like the DBN we are left to seek out methods to approximate the DBM poste-

rior distribution. However, unlike the DBN, the DBM posterior distribution over their

hidden units – while complicated – is easy to approximate with a variational approxi-

mation (as discussed in Sec. 20.1), speciﬁcally a mean ﬁeld approximation. The mean

ﬁeld approximation is a simple form of variational inference, where we restrict the ap-

proximating distribution to fully factorial distributions. In the context of DBMs, the

mean ﬁeld equations capture the bidirectional interactions between layers. In this sec-

tion we derive the iterative approximate inference procedure originally introduced in

Salakhutdinov and Hinton (2009a)

In variational approximations to inference, we approach the task of approximating

a particular target distribution – in our case, the posterior distribution over the hidden

units given the visible units – by some reasonably simple family of distributions. In the

case of the mean ﬁeld approximation, the approximating family is the set of distributions

where the hidden units are conditionally independent.

Let Q(h

(1)

, h

(2)

| v) be the approximation of P (h

(1)

, h

(2)

| v). The mean ﬁeld

assumption implies that

Q(h

(1)

, h

(2)

| v) =

j=1

Q(h

(1)

| v)

k=1

Q(h

(2)

| v). (21.18)

The mean ﬁeld approximation attempts to ﬁnd for every observation a member of

this family of distributions that “best ﬁts” the true posterior P(h

(1)

, h

(2)

| v). By best

ﬁt, we speciﬁcally mean that we wish to ﬁnd the approximation Q that minimizes the

KL-divergence with P , i.e. KL(QkP ) where:

KL(QkP ) =

Q(h

(1)

, h

(2)

| v) log

Q(h

(1)

, h

(2)

| v)

P (h

(1)

, h

(2)

| v)

(21.19)

395

In general, we do not have to provide a parametric form of the approximating distri-

bution beyond enforcing the independence assumptions. The variational approximation

procedure is generally able to recover a functional form of the approximate distribution.

However, in the case of a mean ﬁeld assumption on binary hidden units (the case we are

considering here) there is no loss of generality by ﬁxing a parametrization of the model

in advance.

We parametrize Q as a product of Bernoulli distributions, that is we consider the

probability of each element of h

(1)

to be associated with a parameter. Speciﬁcally, for

each j ∈ {1, . . . , n},

(1)

= P (h

(1)

= 1), where

(1)

∈ [0, 1] and for each k ∈ {1, . . . , m},

(2)

= P(h

(2)

= 1), where

(2)

∈ [0, 1]. Thus we have the following approximation to

the posterior:

Q(h

(1)

, h

(2)

| v) =

j=1

Q(h

(1)

| v)

k=1

Q(h

(2)

| v)

j=1

(

(1)

)

(1)

(1 −

(1)

)

(1−h

(1)

)

k=1

(

(2)

)

(2)

(1 −

(2)

)

(1−h

(2)

)

(21.20)

Of course, for DBMs with more layers the approximate posterior parametrization can

be extended in the obvious way.

Now that we have speciﬁed our family of approximating distributions Q. It remains

to specify a procedure for choosing the member of this family that best ﬁts P . One way to

do this is to explicitly minimize KL(QkP ) with respect to the variational parameters of

Q. We will approach the selection of Q from a slightly diﬀerent, but entirely equivalent,

path. Rather than minimize KL(QkP), we will maximize the variational lower bound

(or evidence lower bound: see Sec. 20.1), which in the context of the 2-hidden-layer

deep Boltzmann machine is given by:

L(Q) =

(1)

(2)

Q(h

(1)

, h

(2)

| v) log

P (v, h

(1)

, h

(2)

; θ)

q(h

(1)

, h

(2)

| v)

= −

(1)

(2)

Q(h

(1)

, h

(2)

| v)E(v, h

(1)

, h

(2)

; θ) −log Z(θ) + H(Q)

21.4.3 Variational Expectation Maximization

TODO Variational EM for DBM

21.4.4 Variational Learning With SML

Because a deep Boltzmann machine contains restricted Boltzmann machines as compo-

nents, the hardness results for computing the partition function and sampling that apply

to restricted Boltzmann machines also apply to deep Boltzmann machines. This means

that evaluating the probability mass function of a Boltzmann machine requires approx-

imate methods such as annealed importance sampling. Likewise, training the model

396

requires approximations to the gradient of the log partition function. See chapter 19 for

a description of these methods.

The posterior distribution over the hidden units in a deep Boltzmann machine is

intractable, due to the interactions between diﬀerent hidden layers. This means that

we must use approximate inference during learning. The standard approach is to use

stochastic gradient ascent on the mean ﬁeld lower bound, as described in chapter 20.

Mean ﬁeld is incompatible with most of the methods for approximating the gradients

of the log partition function described in chapter 19. Moreover, contrastive divergence

oﬀers no speedup relative to naive MCMC methods when the posterior is intractable, be-

cause sampling the negative phase fantasy particles requires sampling from the posterior.

This means that DBMs are usually trained using stochastic maximum likelihood. The

negative phase samples can be generated simply by running a Gibbs sampling chain that

alternates between sampling the odd-numbered layers and sampling the even-numbered

layers.

Unfortunately, training a DBM using variational learning and SML from a random

initialization usually results in failure. In some cases, the model fails to learn to represent

the distribution adequately. In other cases, the DBM may represent the distribution

well, but with no higher likelihood than could be obtained with just an RBM. Note that

a DBM with very small weights in all but the ﬁrst layer represents approximately the

same distribution as an RBM.

It is not clear exactly why this happens. When DBMs are initialized from a pre-

trained conﬁguration, training usually succeeds. See section 21.4.5 for details. One

possibility is that it is diﬃcult to coordinate the learning rate of the stochastic gradi-

ent algorithm with the number of Gibbs steps used in the negative phase of stochastic

maximum likelihood. SML relies on the learning rate being small enough relative to

the number of Gibbs steps that the Gibbs chain can mix again after each update to the

model parameters. The distribution represented by the model can change very rapidly

during the earlier parts of training, and this may make it diﬃcult for the negative

chains employed by SML to fully mix. As described in section 21.4.6, multi-prediction

deep Boltzmann machines avoid the potential inaccuracy of SML by training with a dif-

ferent objective function that is less principled but easier to compute. Another possible

explanation for the failure of joint training with mean ﬁeld and SML is that the Hessian

matrix could be poorly conditioned. This perspective motivates centered deep Boltz-

mann machines, presented in section 21.4.7, which modify the model family in order to

obtain a better conditioned Hessian matrix.

21.4.5 Layerwise Pretraining

The original and most popular method for overcoming the joint training problem of

DBMs is greedy layerwise pretraining. In this method, each layer of the DBM is trained

in isolation as an RBM. The ﬁrst layer is trained to model the input data. Each subse-

quent RBM is trained to model samples from the previous RBM’s posterior distribution.

After all of the RBMs have been trained in this way, they can be combined to form a

DBM. The DBM may then be trained with PCD. Typically PCD training will only make

397

Figure 21.3: The deep Boltzmann machine training procedure used to obtain the state of

the art classiﬁcation accuracy on the MNIST dataset (Srivastava et al., 2014; Salakhut-

dinov and Hinton, 2009a). a) Train an RBM by using CD to approximately maximize

logP(v). b) Train a second RBM that models h

(1)

and Y by using CD-k to approx-

imately maximize log P (h

(1)

, Y ) where h

(1)

is drawn from the ﬁrst RBM’s posterior

conditioned on the data. Increase k from 1 to 20 during learning. c) Combine the two

RBMs into a DBM. Train it to approximately maximize logP(v, y) using stochastic

maximum likelihood with k = 5. d) Delete Y from the model. Deﬁne a new set of

features h

(1)

and h

(

2) that are obtained by running mean ﬁeld inference in the model

lacking Y . Use these features as input to an MLP whose structure is the same as an

additional pass of mean ﬁeld, with an additional output layer for the estimate of Y .

Initialize the MLP’s weights to be the same as the DBM’s weights. Train the MLP to

approximately maximize log P (Y | v) using stochastic gradient descent and dropout.

Figure reprinted from (Goodfellow et al., 2013b).

a small change in the model’s parameters and its performance as measured by the log

likelihood it assigns to the data, or its ability to classify inputs.

Note that this greedy layerwise training procedure is not just coordinate ascent. It

bears some passing resemblance to coordinate ascent because we optimize one subset

of the parameters at each step. However, in the case of the greedy layerwise training

procedure, we actually use a diﬀerent objective function at each step.

TODO: details of combining stacked RBMs into a DBM TODO: partial mean ﬁeld

negative phase

21.4.6 Multi-Prediction Deep Boltzmann Machines

TODO– cite stoyanov TODO

21.4.7 Centered Deep Boltzmann Machines

TODO

This chapter has described the tools needed to ﬁt a very broad class of probabilistic

models. Which tool to use depends on which aspects of the log-likelihood are problem-

atic.

398

Figure 21.4: TODO caption and label, reference from text Figure reprinted from (Good-

fellow et al., 2013b).

399

For the simplest distributions p, the log likelihood is tractable, and the model can be

ﬁt with a straightforward application of maximum likelihood estimation and gradient

ascent as described in chapter

In this chapter, I’ve shown what to do in two diﬀerent diﬃcult cases. If Z is in-

tractable, then one may still use maximum likelihood estimation via the sampling ap-

proximation techniques described in section 19.2. If p(h | v) is intractable, one may still

train the model using the negative variational free energy rather than the likelihood, as

described in 20.4.

It is also possible that both of these diﬃculties will arise. An example of this occurs

with the deep Boltzmann machine (Salakhutdinov and Hinton, 2009b), which is essential

a sequence of RBMs composed together. The model is depicted graphically in Fig. 21.1c.

This model still has the same problem with computing the partition function as the

simpler RBM does. It has also discarded the restricted structure that made P (h | v)

easy to represent in the RBM. The typical way to train the DBM is to minimize the

variational free energy rather than maximize the likelihood. Of course, the variational

free energy still depends on the partition function, so it is necessary to use sampling

techniques to approximate its gradient.

TODO: k-NADE

21.5 Boltzmann Machines for Real-Valued Data

While Boltzmann machines were originally developed for use with binary data, many

applications such as image and audio modeling seem to require the ability to represent

probability distributions over real values. In some cases, it is possible to treat real-

valued data in the interval [0, 1] as representing the expectation of a binary variable

(TODO cite some examples). However, this is not a particularly theoretically satisfying

approach.

21.5.1 Gaussian-Bernoulli RBMs

TODO– cite exponential family harmoniums? TODO– multiple ways of parametrizing

them (citations?)

21.5.2 mcRBMs

TODO–mcRBMs

TODO–HMC

21.5.3 mPoT Model

TODO–mPoT

The term “mcRBM” is pronounced by saying the name of the letters M-C-R-B-M; the ”mc” is not

pronounced like the “Mc” in “McDonald’s.”

400

21.5.4 Spike and Slab Restricted Boltzmann Machines

Spike and slab restricted Boltzmann machines (Courville et al., 2011) or ssRBMs provide

another means of modeling the covariance structure of real-valued data. Compared to

mcRBMs, ssRBMs have the advantage of requiring neither matrix inversion nor Hamil-

tonian Monte Carlo methods.

The spike and slab RBM has two sets of hidden units: the spike units h which

are binary, and the slab units s which are real-valued. The mean of the visible units

conditioned on the hidden units is given by (h  s)W

. In other words, each column

:,i

deﬁnes a component that can be appear in the input. The corresponding spike

variable h

determines whether that component is present at all. The corresponding

slab variable s

determines the brightness of that component, if it is present. When

a spike variable is active, the corresponding slab variable adds variance to the input

along the axis deﬁned by W

:,i

. This allows us to model the covariance of the inputs.

Fortunately, contrastive divergence and persistent contrastive divergence with Gibbs

sampling are still applicable. There is no need to invert any matrix.

Gating by the spike variables means that the true marginal distribution over h  s

is sparse. This is diﬀerent from sparse coding, where samples from the model “almost

never” (in the measure theoretic sense) contain zeros in the code, and MAP inference is

required to impose sparsity.

The primary disadvantage of the spike and slab restricted Boltzmann machine is

that some settings of the parameters can correspond to a covariance matrix that is not

positive deﬁnite. Such a covariance matrix places more unnormalized probability on

values that are farther from the mean, causing the integral over all possible outcomes to

diverge. Generally this issue can be avoided with simple heuristic tricks. There is not yet

any theoretically satisfying solution. Using constrained optimization to explicitly avoid

the regions where the probability is undeﬁned is diﬃcult to do without being overly

conservative and also preventing the model from accessing high-performing regions of

parameter space.

Qualitatively, convolutional variants of the ssRBM produce excellent samples of nat-

ural images. Some examples are shown in Fig. 14.1.

The ssRBM allows for several extensions. Including higher-order interactions and

average-pooling of the slab variables (Courville et al., 2014) enables the model to learn

excellent features for a classiﬁer when labeled data is scarce. Adding a term to the

energy function that prevents the partition function from becoming undeﬁned results

in a sparse coding model, spike and slab sparse coding (Goodfellow et al., 2013c), also

known as S3C.

21.6 Convolutional Boltzmann Machines

As seen in chapter 9, extremely high dimensional inputs such as images place great

strain on the computation, memory, and statistical requirements of machine learning

models. Replacing matrix multiplication by discrete convolution with a small kernel is

the standard way of solving these problems for inputs that have translation invariant

401

spatial or temporal structure. Desjardins and Bengio (2008) showed that this approach

works well when applied to RBMs.

Deep convolutional networks usually require a pooling operation so that the spatial

size of each successive layer decreases. Feedforward convolutional networks often use

a pooling function such as the maximum of the elements to be pooled. It is unclear

how to generalize this to the setting of energy-based models. We could introduce a

binary pooling unit p over n binary detector units d and enforce p = max

by setting

the energy function to be ∞ whenever that constraint is violated. This does not scale

well though, as it requires evaluating 2

diﬀerent energy conﬁgurations to compute the

normalization constant. For a small 3 ×3 pooling region this requires 2

= 512 energy

function evaluations per pooling unit!

Lee et al. (2009) developed a solution to this problem called probabilistic max pool-

ing (not to be confused with “stochastic pooling,” which is a technique for implicitly

constructing ensembles of convolutional feedforward networks). The strategy behind

probabilistic max pooling is to constrain the detector units so at most one may be ac-

tive at a time. This means there are only n + 1 total states (one state for each of the

n detector units being on, and an additional state corresponding to all of the detector

units being oﬀ). The pooling unit is on if and only if one of the detector units is on.

The state with all units oﬀ is assigned energy zero. We can think of this as describing

a model with a single variable that has n + 1 states, or equivalently as model that has

n + 1 variables that assigns energy ∞ to all but n + 1 joint assignments of variables.

While eﬃcient, probabilistic max pooling does force the detector units to be mutually

exclusive, which may be a useful regularizing constraint in some contexts or a harmful

limit on model capacity in other contexts. It also does not support overlapping pooling

regions. Overlapping pool regions are usually required to obtain the best performance

from feedforward convolutional networks, so this constraint probably greatly reduces

the performance of convolutional Boltzmann machines.

Lee et al. (2009) demonstrated that probabilistic max pooling could be used to build

convolutional deep Boltzmann machines

. This model is able to perform operations such

as ﬁlling in missing portions of its input. However, it has not proven especially useful as

a pretraining strategy for supervised learning, performing similarly to shallow baseline

models introduced by Pinto et al. (2008).

TODO: comment on partition function changing when you change the image size,

boundary issues

21.7 Other Boltzmann Machines

TODO–Conditional Boltzmann machine TODO-RNN-RBM TODO–discriminative Boltz-

mann machine TODO–Heng’s class relevant and irrelevant Boltzmann machines TODO–

Honglak’s recent work

The publication describes the model as a ”deep belief network” but because it can be described

as a purely undirected model with tractable layer-wise mean ﬁeld ﬁxed point updates, it best ﬁts the

deﬁnition of a deep Boltzmann machine.

402

21.8 Directed Generative Nets

TODO: sigmoid belief nets TODO: refer back to DBN TODO sparse coding TODO

deconvolutional nets? TODO refer back to S3C and BSC TODO: NADE will be in

RNN chapter, refer back ot it here make sure k-NADE and multi-NADE are mentioned

somewhere TODO: refer to DARN and NVIL? TODO: Stochastic Feedforward nets

21.8.1 Sigmoid Belief Nets

21.8.2 Variational Autoencoders

The variational autoencoder is model

L (21.21)

TODO

21.8.3 Variational Interpretation of PSD

TODO, develop the explanation of Sec. 9.1 of Bengio et al. (2013c).

21.8.4 Generative Adversarial Networks

Generative adversarial networks (TODO cite) are another kind of generative model

based on diﬀerentiable mappings from input noise to samples that resemble the data.

In this sense, they closely resemble variational autoencoders. However, the training

procedure is diﬀerent, and generative model is not necessarily coupled with an inference

network. It is theoretically possible to train an inference network using a strategy

similar to the wake-sleep algorithm, but there is no need to infer posterior variables

during training.

Generative adversarial networks are based on game theory. A generator network is

trained to map input noise z to samples x. This function g(z) deﬁnes the generative

model. The distribution p(z) is not learned; it is simply ﬁxed to some distribution at the

start of training (usually a very unstructured distribution such as a normal or uniform

distribution). We can think of G(z) as deﬁning a conditional distribution

p(x | z) = N(x | g



z),



but in all learning rules we take limit as β → ∞ so we can treat g(z) itself as a sample

and ignore the parametrization of the output distribution.

The generator g is pitted against an adversary: a discriminator network d. The

discriminator network receives data or samples x as input and outputs its estimate

of the probability that x was sampled from the data rather than the model. During

training, d tries to maximize and g tries to minimize a value function measuring the log

probability of d being correct:

∗

= arg min

max

V (g, d)

403

where

v(g, d) = E

x∼p

data

log d(x) + E

model

log (1 − d(x)) .

The optimization of g can be done simply by backpropagating through d then g,

so the learning process requires neither approximate inference nor approximation of a

partition function gradient. In the case where max

v(g, d) is convex (such as the case

where optimization is performed directly in the space of probability density functions)

then the procedure is guaranteed to converge and is asymptotically consistent. In prac-

tice, the procedure can be diﬃcult to make work, because it can be diﬃcult to keep d

optimized well enough to provide a good estimate of how to update g at all times.

21.9 A Generative View of Autoencoders

Many kinds of autoencoders can be viewed as probabilistic models. Diﬀerent autoen-

coders can be interpreted as probabilistic models in diﬀerent ways.

One of the ﬁrst probabilistic interpretations of autoencoders was the view denois-

ing autoencoders as energy-based models trained using regularized score matching. See

Sections 16.8.1 and 19.5 for details. Since the early work (Vincent, 2011a) made the

connection with Gaussian RBMs, this gave denoising auto-encoders with a particu-

lar parametrization a generative interpretation (they could be sampled from using the

MCMC sampling techniques for Gaussian RBMs) .

The next milestone in connecting auto-encoders with a generative interpretation

came with the work of Rifai et al. (2012). It relied on the view of contractive auto-

encoders as estimators of the tangent of the manifold near which probability concen-

trates, discussed in Section 16.9 (see also Figures 16.9, 18.3). In this context, Rifai

et al. (2012) demonstrated experimentally that good samples could be obtained from a

trained contractive auto-encoder by alternating encoding, decoding, and adding noise

in a particular way.

As discussed in Section 16.8.1, the application of the encoder/decoder pair moves

the input conﬁguration towards a more probable one. This can be exploited to actually

sample from the estimated distribution. If you consider most Monte-Carlo Markov

Chain (MCMC) algorithms, they have two elements:

1. move from lower probability conﬁgurations towards higher probability conﬁgura-

tions, and

2. inject randomness so that the chain moves around (and does not stay stuck at

some peak of probability, or mode) and has a chance to visit every conﬁguration

in the whole space, with a relative frequency equal to its probability under the

underlying model.

So conceptually all one needs to do is to perform encode-decode operations (go towards

more probable conﬁgurations) as well as inject noise (to move around the probable

conﬁgurations), as hinted at in (Mesnil et al., 2012; Rifai et al., 2012).

404

21.9.1 Markov Chain Associated with any Denoising Auto-Encoder

The above discussion left open the question of what noise to inject and where, in order

to obtain a Markov chain that would generate from the distribution estimated by the

auto-encoder. Bengio et al. (2013b) showed how to construct such a Markov chain for

generalized denoising autoencoders. Generalized denoising autoencoders are speciﬁed by

a denoising distribution for sampling an estimate of the clean input given the corrupted

input.

˜x

= g(h

)

= f(˜x

)

P (X|!)

X|X)

t+1

Figure 21.5: Each step of the Markov chain associated with a trained denoising auto-

encoder, that generates the samples from the probabilistic model implicitly trained by

the denoising reconstruction criterion. Each step consists in (a) injecting corruption

C in state x, yielding

x, (b) encoding it with f, yielding h = f (

x), (c) decoding the

result with g, yielding parameters ω for the reconstruction distribution, and (d) given

ω, sampling a new state from the reconstruction distribution P (x|ω = g(f (

x))). In the

typical squared reconstruction error case, g(h) =

x, which estimates E[x|

x], corruption

consists in adding Gaussian noise and sampling from P (x|ω) consists in adding another

Gaussian noise to the reconstruction

x. The latter noise level should correspond to the

mean squared error of reconstructions, whereas the injected noise is a hyper-parameter

that controls the mixing speed as well as the extent to which the estimator smoothes the

empirical distribution (Vincent, 2011b). In the ﬁgure, only the C and P conditionals

are stochastic steps (f and g are deterministic computations), although noise can also

be injected inside the auto-encoder, as in generative stochastic networks (Bengio et al.,

2014b)

Each step of the Markov chain that generates from the estimated distribution consists

of the following sub-steps, illustrated in Figure 21.5:

1. starting from the previous state x, inject corruption noise, sampling

x from

405

x|x).

2. Encode

x into h = f(

x).

3. Decode h to obtain the parameters ω = g(h) of P(x|ω = g(h)) = P (x|

x).

4. Sample the next state x from P (x|ω = g(h)) = P (x|

x).

The theorem states that if the auto-encoder P (x|

x) forms a consistent estimator of

corresponding true conditional distribution, then the stationary distribution of the above

Markov chain forms a consistent estimator (albeit an implicit one) of the data generating

distribution of x.

P (

x|x)

P (

x|x)

P (

x|x)

P (x|

Figure 21.6: Illustration of one step of the sampling Markov chain associated with a

denoising auto-encoder (see also Figure 21.5). In the ﬁgure, the data (black circles) are

sitting near a low-dimensional manifold (a spiral, here), and the two stochastic steps

of the Markov chain are ﬁrst to corrupt x (clean image of dog, green circle) into

(noisy image of dog, blue circle) via C(

x|x) (here an isotropic Gaussian noise in green),

and then to sample a new x via the estimated denoising P (x|

x). Note how there are

many possible x which could have given rise to

x, and these all lie on the manifold in

the neighborhood of

x, hence the ﬂattened shape of P (x|

x) (in blue). Modiﬁed from a

ﬁgure ﬁrst created and graciously authorized by Jason Yosinski.

Figure 21.6 illustrates the sampling process of the DAE in a way that complements

Figure 21.5, with a speciﬁc imagined example. For a more elaborate discussion of the

406

probabilistic nature of denoising auto-encoders, and their generalization (Bengio et al.,

2014b), Generative Stochastic Networks (GSNs), see Section 21.10 below. In particular,

the noise does not have to be injected only in the input, and it could be injected anywhere

along the chain. GSNs also generalize DAEs by allowing the state of the Markov chain

to be extended beyond the visible variable x, to include also some latent variable h.

Finally, Section 21.10 discusses training strategies for DAEs that are aimed at making

it a better generative model and not just a good feature learner.

Figure 21.7: Illustration of the eﬀect of the walk-back training procedure, used for

denoising auto-encoders or GSNs in general. The objective is to remove spurious modes

faster by letting the Markov chain go towards them (along the red path, starting on

the purple data manifold and following the arrows plus noise), and then punishing the

Markov chain for this behavior (i.e., walking back to the right place) by telling the chain

to return towards the data manifold (reconstruct the original data).

21.9.2 Clamping and Conditional Sampling

Similarly to Boltzmann machines, denoising auto-encoders and GSNs can be used to

sample from a conditional distribution P (x

), simply by clamping the observed units

and only resampling the free units x

given x

and the sampled latent variables (if

any). This has been introduced by Bengio et al. (2014b).

However, note that Proposition 1 of that paper is missing a condition: the transition

operator (deﬁned by the stochastic mapping going from one state of the chain to the

next) should satisfy detailed balance, described in Section 15.1.1.

An experiment in clamping half of the pixels (the right part of the image) and

running the Markov chain on the other half is shown in Figure 21.8.

407

Figure 21.8: Illustration of clamping the right half of the image and running the Markov

by resampling only the left half at each step. These samples come from a GSN trained

to reconstruct MNIST digits at each time step, i.e., using the walkback procedure.

21.9.3 Walk-Back Training Procedure

The walk-back training procedure was proposed by Bengio et al. (2013b) as a way to

speed-up the convergence of generative training of denoising auto-encoders. Instead

of performing a one-step encode-decode reconstruction, this procedure consists in al-

ternative multiple stochastic encode-decode steps (as in the generative Markov chain)

initialized at a training example (just like with the contrastive divergence algorithm, de-

scribed in Sections 19.2) and 21.2.1) and penalizing the last probabilistic reconstructions

(or all of the reconstructions along the way).

It was shown in that paper that training with k steps is equivalent (in the sense of

achieving the same stationary distribution) as training with one step, but practically

has the advantage that spurious modes farther from the data can be removed more

eﬃciently, as illustrated in Figure 21.7.

Figure 21.9 illustrates the application of the walk-back procedure in a generative

stochastic network, which is described in the next section.

408

data target target target

Figure 21.9: Left: graphical model of the generative Markov chain associated with a gen-

erative stochastic network (GSN). Right speciﬁc case where the latent variable is formed

of several layers, each connected to the one above and the one below, making the genera-

tive process very similar to Gibbs sampling in a deep Boltzmann machine (Salakhutdinov

and Hinton, 2009b). The walk-back training procedure is used, i.e., at every step the re-

construction probability distribution is pushed towards generating the training example

(which also initializes the chain).

21.10 Generative Stochastic Networks

Generative stochastic networks (Bengio et al., 2014b) or GSNs are generalizations of

denoising auto-encoders that include latent variables in the generative Markov chain, in

addition to the visible variables (usually denoted x). The generative Markov chain looks

like the one in Figure 21.10. An example of a GSN structured like a deep Boltzmann

machine and trained by the walk-back procedure is shown in Figure 21.9.

Figure 21.10: Markov chain of a GSN (Generative Stochastic Network) with latent

variables with H and visible variable X, i.e., an unfolding of the generative process with

and H

at step k of the chain.

A GSN is parametrized by two conditional probability distributions which specify

one step of the Markov chain:

1. P (X

) tells how to generate the next visible variable given the current la-

tent state. Such a “reconstruction distribution” is also found in denoising auto-

encoders, RBMs, DBNs and DBMs.

2. P (H

k−1

, X

k−1

) tells how to update the latent state variable, given the previous

latent state and visible variable.

Denoising auto-encoders and GSNs diﬀer from classical probabilistic models (di-

rected or undirected) in that it parametrizes the generative process itself rather than

the mathematical speciﬁcation of the joint distribution of visible and latent variables.

409

Instead, the latter is deﬁned implicitly, if it exists, as the stationary distribution of the

generative Markov chain. The conditions for existence of the stationary distribution are

mild (basically, the chain mixes) but can be violated by some choices of the transition

distributions (for example, if they were deterministic).

One could imagine diﬀerent training criteria for GSNs. The one proposed and eval-

uated by Bengio et al. (2014b) is simply reconstruction log-probability on the visible

units, just like for denoising auto-encoders. This is achieved by clampling X

= x to the

observed example and maximizing the probability of generating x at some subsequent

time steps, i.e., maximizing log P (X

= x|H

), where H

is sampled from the chain,

given X

= x. In order to estimate the gradient of log P (X

= x|H

) with respect to

the other pieces of the model, Bengio et al. (2014b) use the reparametrization trick,

introduced in Section 14.5.1.

The walk-back training protocol (described in Section 21.9.3 was used (Bengio et al.,

2014b) to improve training convergence of GSNS.

21.10.1 Discriminant GSNs

Whereas the original formulation of GSNs (Bengio et al., 2014b) was meant for unsu-

pervised learning and implicitly modeling P (x) for observed data x, it is possible to

modify the framework to optimize P (y|x).

For example, Zhou and Troyanskaya (2014) generalize GSNs in this way, by only

back-propagating the reconstruction log-probability over the output variables, keeping

the input variables ﬁxed. They applied this successfully to model sequences (protein

secondary structure) and introduced a (one-dimensional) convolutional structure in the

transition operator of the Markov chain. Keep in mind that, for each step of the Markov

chain, one generates a new sequence for each layer, and that sequence is the input for

computing other layer values (say the one below and the one above) at the next time

step, as illustrated in Figure 21.11.

Figure 21.11: Markov chain arising out of a discriminant GSN, i.e., where a GSN is used

as a structured output model over a variable Y , conditioned on an input X. Reproduced

with permission from Zhou and Troyanskaya (2014). The structure is as in a GSN (over

the output) but with computations being conditioned on the input X at each step.

Hence the Markov chain is really over the output variable (and associated higher-

410

level hidden layers), and the input sequence only serves to condition that chain, with

back-propagation allowing to learn how the input sequence can condition the output

distribution implicitly represented by the Markov chain. It is therefore a case of using

the GSN in the context of structured outputs, where P (y|x) does not have a simple

parametric form but instead the components of y are statistically dependent of each

other, given x, in complicated ways.

Z¨ohrer and Pernkopf (2014) considered a hybrid model that combines a supervised

objective (as in the above work) and an unsupervised objective (as in the original GSN

work), by simply adding (with a diﬀerent weight) the supervised and unsupervised costs

i.e., the reconstruction log-probabilities of y and x respectively. Such a hybrid criterion

had previously been introduced for RBMs by Larochelle and Bengio (2008a). They show

improved classiﬁcation performance using this scheme.

21.11 Methodological Notes

Researchers studying generative models often need to compare one generative model

to another, usually in order to demonstrate that a newly invented generative model is

better at capturing some distribution than the pre-existing models.

This can be a diﬃcult and subtle task. In many cases, we can not actually evaluate

the log probability of the data under the model, but only an approximation. In these

cases, it’s important to think and communicate clearly about exactly what is being mea-

sured. For example, suppose we can evaluate a stochastic estimate of the log likelihood

for model A, and a deterministic lower bound on the log likelihood for model B. If model

A gets a higher score than model B, which is better? If we care about determining which

model has a better internal representation of the distribution, we actually cannot tell,

unless we have some way of determining how loose the bound for model B is. However,

if we care about how well we can use the model in practice, for example to perform

anomaly detection, then it is fair to say that model A is better based on a criterion spe-

ciﬁc to the practical task of interest, e.g., based on ranking test examples and ranking

criterian such as precision and recall.

Another subtletly of evaluating generative models is that the evaluation metrics are

often hard research problems in and of themselves. It can be very diﬃcult to establish

that models are being compared fairly. For example, suppose we use AIS to estimate

log Z in order to compute log ˜p(x) − log Z for a new model we have just invented. A

computationally economical implementation of AIS may fail to ﬁnd several modes of the

model distribution and underestimate Z, which will result in us overestimating log p(x).

It can thus be diﬃcult to tell whether a good likelihood estimate is due to a good model

or a bad AIS implementation.

Other ﬁelds of machine learning usually allow for some variation in the preprocessing

of the data. For example, when comparing the accuracy of object recognition algorithms,

it is usually acceptable to preprocess the input images slightly diﬀerently for each algo-

rithm based on what kind of input requirements it has. Generative modeling is diﬀerent

because changes in preprocessing, even very small and subtle ones, are completely un-

411

acceptable. Any change to the input data changes the distribution to be captured and

fundamentally alters the task. For example, multiplying the input by 0.1 will artiﬁcially

increase likelihood by 10.

Issues with preprocessing commonly arise when benchmarking generative models on

the MNIST dataset, one of the more popular generative modeling benchmarks. MNIST

consists of grayscale images. Some models treat MNIST images as points in a real

vector space, while others treat them as binary. Yet others treat the grayscale values

as probabilities for a binary samples. It is essential to compare real-valued models

only to other real-valued models and binary-valued models only to other binary-valued

models. Otherwise the likelihoods measured are not on the same space. (For the binary-

valued models, the log likelihood can be at most 0., while for real-valued models it can

be arbitrarily high, since it is the measurement of a density) Among binary models,

it is important to compare models using exactly the same kind of binarization. For

example, we might binarize a gray pixel to 0 or 1 by thresholding at 0.5, or by drawing

a random sample whose probability of being 1 is given by the gray pixel intensity.

If we use the random binarization, we might binarize the whole dataset once, or we

might draw a diﬀerent random example for each step of training and then draw multiple

samples for evaluation. Each of these three schemes yields wildly diﬀerent likelihood

numbers, and when comparing diﬀerent models it is important that both models use

the same binarization scheme for training and for evaluation. In fact, researchers who

apply a single random binarization step share a ﬁle containing the results of the random

binarization, so that there is no diﬀerence in results based on diﬀerent outcomes of the

binarization step.

Finally, in some cases the likelihood seems not to measure any attribute of the

model that we really care about. For example, real-valued models of MNIST can obtain

arbitrarily high likelihood by assigning arbitrarily low variance to background pixels

that never change. Models and algorithms that detect these constant features can reap

unlimited rewards, even though this is not a very useful thing to do. This strongly

suggests a need for developing other ways of evaluating generative models.

Although this is still an open question, this might be achieved by converting the

problem into a classiﬁcation task. For example, we have seen that the NCE method

(Noise Contrastive Estimation, Section 19.6) compares the density of the training data

according to a learned unnormalized model with its density under a background model.

However, generative models do not always provide us with an energy function (equiva-

lently, an unnormalized density), e.g., deep Boltzmann machines, generative stochastic

networks, most denoising auto-encoders (that are not guaranteed to correspond to an

energy function), deep Belief networks, etc. Therefore, it would be interesting to con-

sider a classiﬁcation task in which one tries to distinguish the training examples from the

generated examples. This is precisely what is achieved by the discriminator network of

generative adversarial networks (Section 21.8.4). However, it would require an expensive

operation (training a discriminator) each time one would have to evaluate performance

TODO– have a section on sum-product networks including a citation to James

Martens’ recent paper

412