Chapter 20

Deep Generative Models

In this chapter, we present several of the speciﬁc kinds of generative models that

can be built and trained using the techniques presented in chapters 13, 18 and

19. All of these models represent probability distributions over multiple variables

in some way. Some allow the probability distribution function to be evaluated

explicitly. Others do not allow the evaluation of the probability distribution

function, but support operations that implicitly require knowledge of it, such as

sampling. Some of these models are structured probabilistic models described in

terms of graphs and factors, as described in chapter 13. Others can not easily be

described in terms of factors, but represent probability distributions nonetheless.

20.1 Boltzmann Machines

Boltzmann machines were originally introduced in Ackley et al. (1985) as a gen-

eral “connectionist” approach to learning arbitrary probability distributions over

binary vectors. Boltzmann Machines form the basis of a large number of popular

variants. Indeed, these variants have long ago surpassed the popularity of the

original and most general incarnation of the Boltzmann machine. In this section

we brieﬂy introduce the general Boltzmann machine and discuss the issues that

come up when trying to train and perform inference in the model.

We deﬁne our Boltzmann machine over a d-dimensional binary random vector

x ∈ {0, 1}

. The Boltzmann machine is an energy-based model

, meaning we

deﬁne the joint probability distribution over the model variable using an energy

function.

P (x) =

exp (−E(x))

. (20.1)

Where E(x) is the energy function and Z is the partition function, that ensures

For a general discussion of energy based models see Sec. 13.2.4

553

CHAPTER 20. DEEP GENERATIVE MODELS

(1)

(2)

(1)

(2)

(1)

Figure 20.1: Examples of models that may be built with restricted Boltzmann machines.

a) The restricted Boltzmann machine itself is an undirected graphical model based on a

bipartite graph. There are no connections among the visible units, nor any connections

among the hidden units. Typically every visible unit is connected to every hidden unit

but it is possible to construct sparsely connected RBMs such as convolutional RBMs. b)

A deep belief network is a hybrid graphical model involving both directed and undirected

connections. Like an RBM, it has no intra-layer connections. However, a DBN has

multiple hidden layers, and thus there are connections between hidden units that are in

separate layers. All of the local conditional probability distributions needed by the deep

belief network are copied directly from the local conditional probability distributions of

its constituent RBMs. Note that we could also represent the deep belief network with

a completely undirected graph, but it would need intra-layer connections to capture the

dependencies between parents. c) A deep Boltzmann machine is an undirected graphical

model with several layers of latent variables. Like RBMs and DBNs, DBMs lack intra-

layer connections. DBMs are less closely tied to RBMs than DBNs are. When initializing

a DBM from a stack of RBMs, it is necessary to modify the RBM parameters slightly.

Some kinds of DBMs may be trained without ﬁrst training a set of RBMs.

554

CHAPTER 20. DEEP GENERATIVE MODELS

that the

P (x) = 1. The energy function of the Boltzmann machine is given

by:

E(x) = −x

Ux − b

x, (20.2)

where U is the “weight” matrix of model parameters and b are the oﬀsets for

each x.

In the general setting of the Boltzmann machine, we could consider that the

goal is that we are given a set of observations, each of which are d-dimensional and

that we are to use the joint probability distribution given in Eq. 20.1 describes

the joint probability distribution over the observed variables (also called visible

units). While this scenario is certainly viable, it does limit the kinds of inter-

actions between the observed variables to those described by the weight matrix.

Speciﬁcally it limits the model to 2nd-order interactions.

In the spirit of the “connectionist” approach to density modeling that origi-

nally inspired the Boltzmann machine, it is interesting to consider the case where

not all the variables are observed. In this case, the non-observed variables, or

latent variables can act similarly to hidden units in a multi-layer perceptron and

model higher-order interactions among the visible units.

Formally, we decompose the units into two subsets: the visible units x

and

the latent (or hidden) units x

. Without loss of generality, we can re-express the

energy function decomposing x into subsets x

and x

E(x

, x

) = −x

− x

W x

− x

− b

− c

, (20.3)

Boltzmann Machine Learning As a probabilistic model, it is natural to con-

sider maximum likelihood as the learning paradigm for Boltzmann machines. Ac-

cording to the ML paradigm, we are interested in choosing the parameters that

(locally) maximize the probability of the visible units over a dataset.

Consider a dataset of n examples X

= [x

(1)

, . . . , x

(t)

, . . . , x

]. Our goal is to

maximize the likelihood of this dataset under the Boltzmann machine. Assuming

the data is i.i.d, this amounts to maximizing the following:

`(θ) = log P (X

) =

t=1

log P (x

(t)

). (20.4)

Of course, our Boltzmann machine does not explicitly parametrize a distribution

over the visible units as P (x

(t)

), instead, as given in Eq. 20.3, it is parametrized

via an energy function to joint probability distribution over x

and x

, the hidden

units. In order to recover P (x

(t)

), we need to marginalize out the inﬂuence of x

P (x

(t)

) =

P (x

(t)

, x

(t)

) =

exp

−E(x

(t)

, x

(t)

)

. (20.5)

555

CHAPTER 20. DEEP GENERATIVE MODELS

Combining Eqs. 20.4 and 20.5 gives us our objective function we wish to maxi-

mize. Unfortunately, because Z is a function of the model parameters, maximizing

likelihood function is not amenable to analytically solution. Instead we will do as

we do for the vast majority of deep learning models, we will follow the gradient of

our objective function. The Boltzmann machine likelihood gradient is given by:

∂

∂θ

`(θ) =

∂

∂θ

t=1

log

exp

−E(x

(t)

, x

(t)

)

(20.6)

t=1

∂

∂θ

log

exp

−E(x

(t)

, x

(t)

)

−

∂Z

∂θ

(20.7)

t=1





exp

−E(x

(t)

, x

(t)

)

exp

−E(x

(t)

, x

(t)

)

∂

∂θ

E(x

(t)

, x

(t)

)





−

∂Z

∂θ

(20.8)

20.2 Restricted Boltzmann Machines

Restricted Boltzmann machines are some of the most common building blocks

of deep probabilistic models. They are undirected probabilistic graphical models

containing a layer of observable variables and a single layer of latent variables.

RBMs may be stacked (one on top of the other) to form deeper models. See

Fig. 20.1 for some examples. In particular, Fig. 20.1a shows the graph structure

of an RBM itself. It is a bipartite graph: with no connections permitted between

any variables in the observed layer or between any units in the latent layer.

TODO– review and pointers to other sections of the book This should be the

main place where they are described in detail, earlier they are just an example of

undirected models or an example of a feature learning algorithm.

TODO: please use lower-case letter names for scalars, none of this D and N

stuﬀ. do we even use these variable names anywhere, or do we just deﬁne them

and never refer back to them? if they are never used, delete them, don’t make

the reader hold variables in their head for no payoﬀ

We begin with the binary version of the restricted Boltzmann machine, but

as we see later there are extensions to other types of visible and hidden units.

More formally, we will consider the observed layer to consist of a set of D

binary random variables which we refer to collectively with the vector v, where

the ith element, i.e. v

is a binary random variable. We will refer to the latent or

hidden layer of N binary random variables collectively as h, with the jth random

elements as h

Like the general Boltzmann machine, the restricted Boltzmann machine is an

energy-based model with the joint probability distribution speciﬁed by its energy

556

CHAPTER 20. DEEP GENERATIVE MODELS

function:

P (v = v, h = h) =

exp {−E(v, h)}.

Where E(v, h) is the energy function that parametrizes the relationship between

the visible and hidden variables:

E(v, h) = −b

v − c

h −v

W h, (20.9)

and the Z is the normalizing constant known as the partition function:

Z =

exp {−E(v, h)}.

For many undirected models, it is apparent from the deﬁnition of the partition

function Z that the naive method of computing Z (exhaustively summing over

all states) would be computationally intractable. However, it is still possible that

a more cleverly designed algorithm could exploit regularities in the probability

distribution to compute Z faster than the naive algorithm, so the exponential cost

of the naive algorithm is not a guarantee of the partiion function’s intractability.

In the case of restricted Boltzmann machines, there is actually a hardness result,

proven by Long and Servedio (2010).

20.2.1 Conditional Distributions

The intractable partition function Z, implies that the joint probability distribu-

tion is also intractable (in the sense that the normalized probability of a given

joint conﬁguration of [v, h] is generally not available). However, due the bipartite

graph structure, the restricted Boltzmann machine has the very special property

that its conditional distributions P(h | v) and P(v | h) are factorial and relatively

simple to compute and sample from. Indeed, it is this property that has made

the RBM a relatively popular model for a wide range of applications including

image modeling (TODO CITE), speech processing (TODO CITE) and natural

language processing (TODO CITE).

Deriving the conditional distributions from the joint distribution is straight-

557

CHAPTER 20. DEEP GENERATIVE MODELS

forward.

p(h | v) =

p(h, v)

p(v)

p(h, v)

p(v)

exp

v + c

h + v

W h

exp

h + v

W h

exp







j=1

:,j







j=1

exp

+ v

:,j

Since we are conditioning on the visible units v, we can treat these as constants

w.r.t. the distribution p(h | v). The factorial nature of the conditional p(h | v)

follows immediately from our ability to wright the joint probability over the vector

h as the product of (unnormalized) distributions over the individual elements,

. It is now a simple matter of normalizing the distributions over the individual

binary h

P (h

= 1 | v) =

P (h

= 1 | v)

P (h

= 0 | v) +

P (h

= 1 | v)

exp



+ v

:,j



exp {0} + exp {c

+ v

:,j

}

= sigmoid



+ v

:,j



. (20.10)

We can now express the full conditional over the hidden layer as the factorial

distribution:

P (h | v) =

j=1

sigmoid



+ v

:,j



. (20.11)

A similar derivation will show that the other condition of interest to us, P (v |

h), is also a factorial distribution:

P (v | h) =

i=1

sigmoid (b

+ W

i,:

h) . (20.12)

558

CHAPTER 20. DEEP GENERATIVE MODELS

20.2.2 RBM Gibbs Sampling

The factorial nature of these conditions is a very useful property of the RBM,

and allows us to eﬃciently draw samples from the joint distribution via a block

Gibbs sampling strategy (see section 14.1 for a more complete discussion of Gibbs

sampling methods).

Block Gibbs sampling simply refers to the situation where in each step of Gibbs

sampling, multiple variables (or a “block” of variables) are sampled jointly. In

the case of the RBM, each iteration of block Gibbs sampling consists of two steps.

Step 1: Sample h

(l)

∼ P (h | v

(l)

). Due to the factorial nature of the conditionals,

we can simultaneously and independently sample from all the elements of h

(l)

given v

(l)

. Step 2: Sample v

(l+1)

∼ P (v | h

(l)

). Again, the factorial nature

of the conditional P(v | h

(l)

) allows us can simultaneously and independently

sample from all the elements of v

(l+1)

given h

(l)

20.3 Training Restricted Boltzmann Machines

Despite the simplicity of the RBM conditionals, training these models is not with-

out its complications. As a probabilistic model, a sensible inductive principle for

estimating the model parameters is maximum likelihood – though other possi-

bilities are certainly possible Marlin et al. (2010) and will be discussed later in

Sec. 20.3.3. In the following we derive the maximum likelihood gradient with

respect to the model parameters.

Let us consider that we have a batch (or minibatch) of n examples taken from

an i.i.d dataset (independently and identically distributed examples) {v

(1)

, . . . , v

(t)

, . . . , v

(n)

The log-likelihood under the RBM with parameters b (visible unit biases), c (hid-

den unit biases) and W (interaction weights) is given by:

`(W , b, c) =

t=1

log P (v

(t)

)

t=1

log

P (v

(t)

n,:

, h)

t=1

log

exp

−E(v

(t)

, h)

− n log Z

t=1

log

exp

−E(v

(t)

, h)

− n log

v,h

exp {−E(v, h)}

(20.13)

559

CHAPTER 20. DEEP GENERATIVE MODELS

In the last line of the equation above, we have used the deﬁnition of the partition

function.

To maximize the likelihood of the data under the restricted Boltzmann ma-

chine, we consider the gradient of the likelihood with respect to the model pa-

rameters, which we will refer to collectively as θ = {b, c, W }:

∇

`(θ) = ∇

t=1

log

exp

−E(v

(t)

, h)

− n

∂

∂θ

log

v,h

exp {−E(v, h)}

t=1

exp



−E(v

(t)

, h)



∇

− E(v

(t)

, h)

exp



−E(v

(t)

, h)



− n

v,h

exp {−E(v, h)}∇

− E(v, h)

v,h

exp {−E(v, h)}

t=1

P (h|v

(t)

)

∇

− E(v

(t)

, h)

− nE

P (v,h)

[∇

− E(v, h)] (20.14)

As we can see from Eq. 20.14, the gradient of the log likelihood is speciﬁed as the

diﬀerence between two expectations of the gradient of the energy function, The

ﬁrst expectation (the data term) is with respect to the product of the empirical

distribution over the data, P (v) = 1/n

t=1

δ(x − v

(t)

)

and the conditional

distribution P (h | v

(t)

). The second expectation (the model term) is with respect

to the joint model distribution P (v, h).

This diﬀerence between a data-driven term and a model-driven term is not

unique to RBMs, as discussed in some detail in Sec. 18.2, this is a general feature

of the maximum likelihood gradient for all undirected models.

We can complete the derivation of log-likelihood gradient by expanding the

term: ∇

− E(v, h). We will consider ﬁrst the gradient of the negative energy

function of W .

(20.15)

nabla

− E(v, h) =

∂

∂W



v + c

h + v

W h



= hv

(20.16)

The gradients with respect to b and c are similarly derived:

∇

− E(v, h) = v, ∇

− E(v, h) = h (20.17)

As discussed in Sec. 3.10.5, we use the term empirical distribution to refer to a mixture over

delta functions placed on training examples

560

CHAPTER 20. DEEP GENERATIVE MODELS

Putting it all together we can the following equations for the gradients with

respect to the RBM parameters and given n training examples:

∇

`(W , b, c) =

t=1

(t)

(t) >

− NE

P (v,h)

∇

`(W , b, c) =

t=1

(t)

− nE

P (v,h)

[v]

∇

`(W , b, c) =

t=1

(t)

− nE

P (v,h)

[h]

where we have deﬁned

(t)

= E

P (h|v

(t)

)

[h] = sigmoid



c + v

(t)



. (20.18)

While we are able to write down these expressions for the log-likelihood gra-

dient, unfortunately, in most situations of interest, we are not able to use them

directly to calculate gradients. The problem is the expectations over the joint

model distribution P(v, h). While we have conditional distributions P (v | h)

and P(h | v) that are easy to work with, the RBM joint distribution is not

amenable to analytic evaluation of the expectation E

P (v,h)

[f(v, h)].

This is bad news—it implies that in most cases it is impractical to compute

the exact log-likelihood gradient. Fortunately, as discussed in Sec. 18.2, there

are two widely used approximation strategies that have been applied to the train-

ing of RBM with some degree of success: contrastive divergence and stochastic

maximum likelihood.

In the following sections we discuss two diﬀerent strategies to approximate this

gradient that have been applied to training the RBM. However, before getting into

the actual training algorithms, it is worth considering what general approaches

are available to us in approximating the log-likelihood gradient. As we mentioned,

our problem stems from the expectation over the joint distribution P (v, h), but we

know that we have access to factorial conditionals and that we can use these as the

basis of a Gibbs sampling procedure to recover samples form the joint distribution

(as discussed in Sec. 20.2.2). Thus, we can imagine using, for example, T MCMC

samples from P (v, h) to form a Monte Carlo estimate of the expectations over

the joint distribution:

P (v,h)

[f(v, h)] ≈

t=1

f(v

(t)

, h

(t)

). (20.19)

There is a problem with this strategy that has to do with the initialization of the

MCMC chain. MCMC chains typically require a burn-in period, where the chain

561

CHAPTER 20. DEEP GENERATIVE MODELS

20.3.1 Contrastive Divergence Training of the RBM

As discussed in a more general context in Sec. 18.2, Contrastive divergence (CD)

seeks to approximate the expectation over the joint distribution with samples

drawn from short Gibbs sampling chains. CD deals with the typical requirement

for an extended burn-in sample sequence by initializing these chains at the data

points used in the data-dependent, conditional term. The result is a biased ap-

proximation of the log-likelihood gradient (Carreira-Perpi˜nan and Hinton, 2005;

Bengio and Delalleau, 2009; Fischer and Igel, 2011), that never-the-less has been

empirically shown to be eﬀective. The constrastive divergence algorithm, as ap-

plied to RBMs, is given in Algorithm 20.1.

Algorithm 20.1 The contrastive divergence algorithm, using gradient ascent as

the optimization procedure.

Set , the step size, to a small positive number

Set k, the number of Gibbs steps, high enough to allow a Markov chain of

p(v; θ) to mix when initializedfrom p

data

. Perhaps 1-20 to train an RBM on a

small image patch.

while Not converged do

Sample a minibatch of m examples from the training set {v

(1)

, . . . , v

(m)

∆

←

t=1

(t)

(t) >

∆

←

t=1

(t)

∆

←

t=1

(t)

for t = 1 to m do

(t)

← v

(t)

end for

for l = 1 to k do

for t = 1 to m do

(t)

sampled from

j=1

sigmoid



(t) >

:,j



(t)

sampled from

i=1

sigmoid



+ W

i,:

(t)



end for

(t)

← sigmoid



c +

(t) >



∆

← ∆

−

t=1

(t)

(t) >

∆

← ∆

−

t=1

(t)

∆

← ∆

−

t=1

(t)

W ← W + ∆

b ← b + ∆

c ← c + ∆

end while

562

CHAPTER 20. DEEP GENERATIVE MODELS

20.3.2 Stochastic Maximum Likelihood (Persistent Contrastive

Divergence) for the RBM

While contrastive divergence has been the most popular method of training

RBMs, the stochastic maximum likelihood (SML) algorithm (Younes, 1998; Tiele-

man, 2008) is known to be a competitive alternative – especially if we are inter-

ested in recovering the best possible generative model (i.e. achieving the highest

possible test set likelihood). As with CD, the general SML algorithm is described

in Sec. 18.2. Here we are concerned with how to apply the algorithm to training

an RBM.

In comparison to the CD algorithm, SML uses an alternative solution to the

problem of how to approximate the partition function’s contribution to the log-

likelihood gradient. Instead of initializing the k-step MCMC chain with the cur-

rent example from the training set, in SML we initialize the MCMC chain for

training iteration s with the last state of the MCMC chain from the last training

iteration (s−1). Assuming that the gradient updates to the model parameters do

not signiﬁcantly change the model, the MCMC state of the last iteration should

be close to the equilibrium distribution at iteration s – minimizing the number of

“burn-in” MCMC steps needed to reach equilibrium at the current iteration. As

with CD, in practice we often use just one Gibbs step between learning iterations.

Algorithm 20.2 describes the SML algorithm as applied to RBMs.

TODO: include experimental examples, i.e. an RBM trained with CD on

MNIST

20.3.3 Other Inductive Principles

TODO:Other inductive principles have been used to train RBMs. In this section

we brieﬂy discuss these.

20.4 Deep Belief Networks

Deep belief networks (DBNs) were one of the ﬁrst successful non-convolutional

architectures. The introduction of deep belief networks in 2006 began the cur-

rent deep learning renaissance. Prior to the introduction of deep belief networks,

deep models were considered too diﬃcult to optimize, due to the vanishing and

exploding gradient problems and the existence of plateaus, negative curvature,

and suboptimal local minima that can arise in neural network objective func-

tions. Kernel machines with convex objective functions dominated the research

landscape. Deep belief networks demonstrated that deep architectures can be

successful, by outperforming kernelized support vector machines on the MNIST

dataset (Hinton et al., 2006). Today, deep belief networks have mostly fallen out

563

CHAPTER 20. DEEP GENERATIVE MODELS

Algorithm 20.2 The stochastic maximum likelihood / persistent contrastive

divergence algorithm for training an RBM.

Set , the step size, to a small positive number

Set k, the number of Gibbs steps, high enough to allow a Markov chain of

p(v, h; θ +∆

) toburn in, starting from samples from p(v, h; θ). Perhaps 1 for

RBM on a small image patch.

Initialize a set of m samples {

(1)

, . . . ,

(m)

} to random values (e.g., from a uni-

form or normal distribution, or possibly a distribution with marginals matched

to the model’s marginals)

while Not converged do

Sample a minibatch of m examples {v

(1)

, . . . , v

(m)

} from the training set.

∆

←

t=1

(t)

(t) >

∆

←

t=1

(t)

∆

←

t=1

(t)

for l = 1 to k do

for t = 1 to m do

(t)

sampled from

j=1

sigmoid



(t) >

:,j



(t)

sampled from

i=1

sigmoid



+ W

i,:

(t)



end for

∆

← ∆

−

t=1

(t)

(t) >

∆

← ∆

−

t=1

(t)

∆

← ∆

−

t=1

(t)

W ← W + ∆

b ← b + ∆

c ← c + ∆

end while

of favor and are rarely used, even compared to other unsupervised or generative

learning algorithms, but they are still deservedly recognized for their important

role in deep learning history.

Deep belief networks are generative models with several layers of latent vari-

ables. The latent variables are typically binary, and the visible units may be

binary or real. There are no intra-layer connections. Usually, every unit in each

layer is connected to every unit in each neighboring layer, though it is possible to

construct more sparsely connected DBNs. The connections between the top two

layers are undirected. The connections between all other layers are directed, with

the arrows pointed toward the layer that is closest to the data. See Fig. 20.1b for

an example.

564

CHAPTER 20. DEEP GENERATIVE MODELS

A DBN with L hidden layers contains L weight matrices: W

(1)

, . . . , W

(L)

. It

also contains L + 1 bias vectors: b

(0)

, . . . , b

(L)

with b

(0)

providing the biases for

the visible layer. The probability distribution represented by the DBN is given

p(h

(L)

, h

(L−1)

) ∝ exp



(L)

+ b

(L−1)>

(L−1)

+ h

(L−1)>

(L)



p(h

(l)

= 1 | h

(l+1)

) = σ



(l)

+ W

(l+1)>

:,i

(l+1)



∀i, ∀l ∈ 1, . . . , L − 2,

p(v

= 1 | h

(1)

) = σ



(0)

+ W

(1)>

:,i

(1)



∀i.

In the cause of real-valued visible units, substitute

v ∼ N



v | b

(0)

+ W

(1)>

(1)

, β

−1



with β diagonal for tractability. Generalizations to other exponential family vis-

ible units are straightforward, at least in theory. Note that a DBN with only one

hidden layer is just an RBM.

To generate a sample from a DBN, we ﬁrst run several steps of Gibbs sampling

on the top two hidden layers. This stage is essentially drawing a sample from the

RBM deﬁned by the top two hidden layers. We can then use a single pass of

ancestral sampling through the rest of the model to draw a sample from the

visible units.

Inference in a deep belief network is intractable due to the explaining away

eﬀect within each directed layer, and due to the interaction between the two ﬁnal

hidden layers. Evaluating or maximizing the standard evidence lower bound on

the log-likelihood is also intractable, because the evidence lower bound takes the

expectation of cliques whose size is equal to the network width.

Evaluating or maximizing the log-likelihood requires not just confronting the

problem of intractable inference to marginalize out the latent variables, but also

the problem of an intractable partition function within the undirected model of

the last two layers.

As a hybrid of directed and undirected models, deep belief networks encounter

many of the diﬃculties associated with both families of models. Because deep be-

lief networks are partially undirected, they require Markov chains for sampling

and have an intractable partition function. Because they are directed and gener-

ally consist of binary random variables, their evidence lower bound is intractable.

TODO–training procedure TODO–discriminative ﬁne-tuning TODO–view of

MLP as variational inference with very loose bound comment on how this does

not capture intra-layer explaining away interactions comment on how this does

565

CHAPTER 20. DEEP GENERATIVE MODELS

not capture inter-layer feedback interactions TODO–quantitative analysis with

AIS TODO–wake sleep?

The term “deep belief network” is commonly used incorrectly to refer to any

kind of deep neural network, even networks without latent variable semantics. The

term “deep belief network” should refer speciﬁcally to models with undirected

connections in the deepest layer and directed connections pointing downward

between all other pairs of sequential layers.

The term “deep belief network” may also cause some confusion because the

term “belief network” is sometimes used to refer to purely directed models, while

deep belief networks contain an undirected layer. Deep belief networks also share

the acronym DBN with dynamic Bayesian networks, which are Bayesian networks

for representing Markov chains.

20.5 Deep Boltzmann Machines

A deep Boltzmann machine (DBM) is another kind of deep, generative model

(Salakhutdinov and Hinton, 2009a). Unlike the deep belief network (DBN), it

is an entirely undirected model. Unlike the RBM, the DBM has several layers

of latent variables (RBMs have just one). But like the RBM, within each layer,

each of the variables are mutually independent, conditioned on the variables in

the neighboring layers. See Fig. 20.2 for the graph structure.

Like RBMs and DBNs, DBMs typically contain only binary units – as we

assume in our development of the model – but it may sometimes contain real-

valued visible units.

A DBM is an energy-based model, meaning that the the joint probability

distribution over the model variables is parametrized by an energy function E. In

the case of a deep Boltzmann machine with one visible layer, v, and three hidden

layers, h

(1)

, h

(2)

andh

(3)

, the joint probability is given by:



v, h

(1)

, h

(2)

, h

(3)



Z(θ)

exp



−E(v, h

(1)

, h

(2)

, h

(3)

; θ)



. (20.20)

The DBM energy function is:

E(v, h

(1)

, h

(2)

, h

(3)

; θ) = −v

(1)

− h

(1)>

(2)

− h

(2)>

(3)

(20.21)

In comparison to the RBM energy function (Eq. 20.9), the DBM energy

function includes connections between the hidden units (latent variables) in the

form of the weight matrices (W

(2)

and W

(3)

). As we will see, these connections

have signiﬁcant consequences for both the model behavior as well as how we go

about performing inference in the model.

566

CHAPTER 20. DEEP GENERATIVE MODELS

connections

(weights)

(1)

(2)

(3)

hidden layers

(binary units)

visible layer

(binary units)

(2)

(1)

(3)

Figure 20.2: The deep Boltzmann machine (oﬀsets on all units are present but suppressed

to simplify notation).

567

CHAPTER 20. DEEP GENERATIVE MODELS

In comparison to fully connected Boltzmann machines (with every unit con-

nected to every other unit), the DBM oﬀers some similar advantages as oﬀered by

the RBM. Speciﬁcally, as illustrated in Fig. (TODO: include ﬁgure), the DBM

layers can be organized into a bipartiite graph, with odd layers on one side and

even layers on the other. This immediately implies that when we condition on the

variables in the even layer, the variables in the odd layers become conditionally

independent. Of course, when we condition on the variables in the odd layers,

the variables in the even layers also become conditionally independent.

We show this explicitly for the conditional distribution P (h

(1)

= 1 | v, h

(2)

in the case of a DBM with two hidden layers (of course, this result generalizes to

a DBM with any number of layers).

P (h

(1)

| v, h

(2)

) =

P (h

(1)

, v, h

(2)

)

P (v, h

(2)

)

exp



(1)

+ h

(1)>

(2)



(1)

···

(1)

exp



(1)

+ h

(1)>

(2)



exp



(1)

+ h

(1)>

(2)



(1)

···

(1)

exp



(1)

+ h

(1)>

(2)



exp



j=1

(1)

:,j

(1)

+ h

(1)>

(2)

j,:

(2)



(1)

···

(1)

exp



(1)

:,j

(1)

+ h

(1)>

(2)



exp



(1)

:,j

(1)

+ h

(1)>

(2)

j,:

(2)



(1)

···

(1)

exp



(1)

:,j

(1)

+ h

(1)>

(2)



exp



(1)

:,j

(1)

+ h

(1)>

(2)

j,:

(2)



(1)

exp



(1)

:,j

(1)

+ h

(1)>

(2)

j,:

(2)



exp



(1)

:,j

(1)

+ h

(1)>

(2)

j,:

(2)



1 + exp



(1)

:,j

+ W

(2)

j,:

(2)



P (h

(1)

| v, h

(2)

). (20.22)

From the above we can conclude that the conditional distribution for any

layer of the DBM conditioned on the neighboring layers, is fractorial (i.e. all

variables in the layer are conditionally independent). Further, we’ve shown that

568

CHAPTER 20. DEEP GENERATIVE MODELS

this conditional distribution is given by a logistic sigmoid function:

P (h

(1)

= 1 | v, h

(2)

) =

exp



(1)

:,j

+ W

(2)

j,:

(2)



1 + exp



(1)

:,j

+ W

(2)

j,:

(2)



1 + exp



−v

(1)

:,j

− W

(2)

j,:

(2)



= sigmoid



(1)

:,j

+ W

(2)

j,:

(2)



. (20.23)

For the two layer DBM, the conditional distributions of the remaining two

layers (v, h

(2)

) also factorize. That is P (v | h

(1)

) =

i=1

P (v

| h

(1)

), where

P (v

= 1 | h

(1)

) = sigmoid



(1)

i,:

(1)



. (20.24)

Also, P (h

(2)

| h

(1)

) =

k=1

P (h

(2)

| h

(1)

), where

P (h

(2)

= 1 | h

(1)

) = sigmoid



(1)>

(2)

:,k



. (20.25)

20.5.1 Interesting Properties

TODO: comparison to DBNs TODO: comparison to neuroscience (local learn-

ing) “most biologically plausible” TODO: description of easy mean ﬁeld TODO:

description of sampling, comparison to general Boltzmann machines,DBNs

20.5.2 DBM Mean Field Inference

For the two hidden layer DBM, the conditional distributions, P (v | h

(1)

), P (h

(1)

v, h

(2)

), and P (h

(2)

| h

(1)

) are factorial, however the posterior distribution over all

the hidden units given the visible unit, i.e. P(h

(1)

, h

(2)

| v), can be complicated.

This is, of course, due to the interaction weights W

(2)

between h

(1)

and h

(2)

which render these variables mutually dependent, given an observed v.

So, like the DBN we are left to seek out methods to approximate the DBM

posterior distribution. However, unlike the DBN, the DBM posterior distribution

over their hidden units – while complicated – is easy to approximate with a varia-

tional approximation (as discussed in Sec. 19.1), speciﬁcally a mean ﬁeld approx-

imation. The mean ﬁeld approximation is a simple form of variational inference,

where we restrict the approximating distribution to fully factorial distributions.

In the context of DBMs, the mean ﬁeld equations capture the bidirectional in-

teractions between layers. In this section we derive the iterative approximate

inference procedure originally introduced in Salakhutdinov and Hinton (2009a)

569

CHAPTER 20. DEEP GENERATIVE MODELS

In variational approximations to inference, we approach the task of approx-

imating a particular target distribution – in our case, the posterior distribution

over the hidden units given the visible units – by some reasonably simple family

of distributions. In the case of the mean ﬁeld approximation, the approximat-

ing family is the set of distributions where the hidden units are conditionally

independent.

Let Q(h

(1)

, h

(2)

| v) be the approximation of P (h

(1)

, h

(2)

| v). The mean ﬁeld

assumption implies that

Q(h

(1)

, h

(2)

| v) =

j=1

Q(h

(1)

| v)

k=1

Q(h

(2)

| v). (20.26)

The mean ﬁeld approximation attempts to ﬁnd for every observation a mem-

ber of this family of distributions that “best ﬁts” the true posterior P (h

(1)

, h

(2)

v). By best ﬁt, we speciﬁcally mean that we wish to ﬁnd the approximation Q

that minimizes the KL-divergence with P , i.e. KL(QkP ) where:

KL(QkP ) =

Q(h

(1)

, h

(2)

| v) log

Q(h

(1)

, h

(2)

| v)

P (h

(1)

, h

(2)

| v)

(20.27)

In general, we do not have to provide a parametric form of the approximating

distribution beyond enforcing the independence assumptions. The variational

approximation procedure is generally able to recover a functional form of the

approximate distribution. However, in the case of a mean ﬁeld assumption on

binary hidden units (the case we are considering here) there is no loss of generality

by ﬁxing a parametrization of the model in advance.

We parametrize Q as a product of Bernoulli distributions, that is we consider

the probability of each element of h

(1)

to be associated with a parameter. Specif-

ically, for each j ∈ {1, . . . , n},

(1)

= P (h

(1)

= 1), where

(1)

∈ [0, 1] and for

each k ∈ {1, . . . , m},

(2)

= P (h

(2)

= 1), where

(2)

∈ [0, 1]. Thus we have the

following approximation to the posterior:

Q(h

(1)

, h

(2)

| v) =

j=1

Q(h

(1)

| v)

k=1

Q(h

(2)

| v)

j=1

(

(1)

)

(1)

(1 −

(1)

)

(1−h

(1)

)

k=1

(

(2)

)

(2)

(1 −

(2)

)

(1−h

(2)

)

(20.28)

Of course, for DBMs with more layers the approximate posterior parametrization

can be extended in the obvious way.

570

CHAPTER 20. DEEP GENERATIVE MODELS

Now that we have speciﬁed our family of approximating distributions Q. It

remains to specify a procedure for choosing the member of this family that best

ﬁts P . One way to do this is to explicitly minimize KL(QkP ) with respect to the

variational parameters of Q. We will approach the selection of Q from a slightly

diﬀerent, but entirely equivalent, path. Rather than minimize KL(QkP ), we will

maximize the variational lower bound (or evidence lower bound: see Sec. 19.1),

which in the context of the 2-hidden-layer deep Boltzmann machine is given by:

L(Q) =

(1)

(2)

Q(h

(1)

, h

(2)

| v) log

P (v, h

(1)

, h

(2)

; θ)

q(h

(1)

, h

(2)

| v)

= −

(1)

(2)

Q(h

(1)

, h

(2)

| v)E(v, h

(1)

, h

(2)

; θ) − log Z(θ) + H(Q), (20.29)

where Z(θ) is the DBM partition function and H(Q) is the entropy of the mean

ﬁeld distribution.

We wish to maximize the variational lower bound in Eq. 20.29 with respect

to the mean ﬁeld parameters of Q(h

(1)

, h

(2)

| v). Substituting Eq. 20.28 for

Q(h

(1)

, h

(2)

| v) in the variational lower bound, we get:

L(q) =

(1)

(2)

− ln Z(θ) + H(q). (20.30)

We maximize the above expression (Eq. 20.30) by taking deriatives with

respect to the variational parameters and solving for the system of ﬁxed point

equations:

∂

(1)

L(q) = 0 ∀j ∈ {1, . . . , n},

∂

(2)

L(q) = 0 ∀k ∈ {1, . . . , m}

The gradient with respect to, for example,

(1)

is reasonable straightforward

571

CHAPTER 20. DEEP GENERATIVE MODELS

to evaluate:

∂

(1)

L(q) =

∂

(1)





(1)

(2)

− ln Z(θ) + H(q)





∂

(1)





(1)

(2)

− ln Z(θ)

−



(1)

+ (1 −

(1)

) ln(1 −

(1)

)



−



(2)

+ (1 −

(2)

) ln(1 −

(2)

)



(1)

(2)

− ln





(1)

1 −

(1)





where in the second line, we have just expanded the terms involved in the entropy

H(q). Setting this derivative to zero and solving for

(1)

, we have

∂

(1)

L(q) = 0 =

(1)

(2)

− ln

(1)

1 −

(1)

= sigmoid

(1)

(2)

A similar derivation leads to the other set of equations for the second hidden

layer variational parameters. Putting these together, we have the following system

of equations:

(1)

= sigmoid

(1)

(2)

, ∀j (20.31)

(2)

= sigmoid





(2)

(1)





, ∀k (20.32)

At a ﬁxed point of this system of equations, we have a local maximum of our

variational lower bound L(q). Thus they deﬁne a iterative algorithm where we

intersperse updates of

(1)

(using Eq. 20.31) and updates of

(2)

(using Eq.

20.32). So variational inference in the two hidden layer deep Boltzmann machine

amounts to iterating these update equations for

(1)

and

(2)

until convergence. In

practice, ≈ 10 iterations is usually suﬃcient. Extending approximate variational

inference to deeper DBMs is straightforward.

572

CHAPTER 20. DEEP GENERATIVE MODELS

20.5.3 DBM Parameter Learning

Because a deep Boltzmann machine contains restricted Boltzmann machines as

components, the hardness results for computing the partition function and sam-

pling that apply to restricted Boltzmann machines also apply to deep Boltzmann

machines. This means that evaluating the probability mass function of a Boltz-

mann machine requires approximate methods such as annealed importance sam-

pling. Likewise, training the model requires approximations to the gradient of the

log partition function. See chapter 18 for a general description of these methods.

The posterior distribution over the hidden units in a deep Boltzmann ma-

chine is intractable, due to the interactions between diﬀerent hidden layers. This

means that we must use approximate inference during learning. The standard

approach is to use stochastic gradient ascent on the mean ﬁeld lower bound, as

described in chapter 19. Mean ﬁeld is incompatible with most of the methods for

approximating the gradients of the log partition function described in chapter 18.

Moreover, it has been observed that for contrastive divergence to work well, it is

important that the samples from the posterior (e.g., for the 2 hidden layer DBM:

P (h

(1)

, h

(2)

| v)) be exact (Salakhutdinov and Hinton, 2009b). In the case of

the DBM, the intractability of the posterior means that we would have to run

a Gibbs sampler until the samples converged to samples from the true posterior

(i.e. until they “burned in”). Thus for the DBM, CD oﬀers no speedup relative

to naive MCMC methods. Instead, DBMs are usually trained using a variant of

stochastic maximum likelihood. The negative phase samples can be generated

simply by running a Gibbs sampling chain that alternates between sampling the

odd-numbered layers and sampling the even-numbered layers.

Learning in the DBM can equivalently be considered as performing a varia-

tional form of the Expectation Maximization (EM) algorithm. Speciﬁcally, con-

sider the variational lower bound for the two-layer DBM (making the dependency

on the model parameters explicit):

L(Q, θ) =

(1)

(2)

− ln Z(θ) + H(Q).

This expression lower bounds the likelihood P(v | θ). So by maximizing this

bound we hope to improve the likelihood. Thus we can think of L(Q, θ) as a

surrogate objective function for the DBM. From this perspective it is natural

to consider a 2-step optimization procedure. In the ﬁrst step (the E-step or

expectation step), we optimize L(Q, θ) with respect to the variational parameters.

In the case of the two-layer DBM this amounts to solving for

(1)

and

(2)

via

the iterative scheme introduced above. Then in the second step (the M-step or

maximization step), we optimize L(Q, θ) with respect to the model parameters

θ.

573

CHAPTER 20. DEEP GENERATIVE MODELS

Note that maximizing the variational lower bound with respect to the param-

eters does not guarantee that we improve the true likelihood P (v | θ) on every

step.

That said, in practice we often ﬁnd that we are able to make progress in

training DBMs by maximizing the lower bound L(Q, θ).

Unlike the standard M-step we typically have as part of the EM algorithm, our

M-step will not actually maximize L(Q, θ) with respect to θ (holding Q ﬁxed).

The presence of the partition function makes it impractical to solve the system of

equations ∇

L(Q, θ) = 0 for θ. Instead we will be content to make incremental

progress toward this maximum by taking a small step in the direction of the

gradient ∇

L(Q, θ). In the case of the 2-hidden layer DBM, this is given by:

∇

L(Q, θ) =

∂

∂θ





(1)

(2)

− ln Z(θ) + H(Q)





∂

∂θ





(1)

(2)





−

∂

∂θ

ln Z(θ)

(20.33)

The ﬁrst term in Eq. 20.33 is straightforward, once the values of

(1)

and

(2)

have been computed in the E-step. Our use of variation approximate inference

has rendered learning in the DBM as analogous to training in the RBM where

the likelihood gradient (Eq. 20.14) is also composed of a analytically tractable

term and a term involving the gradient of the partition function:

− nE

P (v,h

(1)

(2)

)

∇

− E(v, h

(1)

, h

(2)

)

(20.34)

Similar to the RBM case, the partition function’s contribution to the gradient

of the variational lower bound is intractable. We approximate it using a varia-

tional version of stochastic maximum likelihood

(VSML) algorithm. The non-

variational version of stochastic maximum likelihood algorithm is discussed in

Sec. 18.2 and is applied to RBMs in Sec. 20.3.2.

Unlike in the RBM, the interaction between the hidden units of the DBM

precludes a direct application of the contrastive divergence training algorithm.

Speciﬁcally the issue is that, in the positive phase, in order to get samples from

In standard EM we do have just a guarantee. The diﬀerence is that in the case of standard

EM we assume the true posterior is tractable and therefore we can set Q(h | v, θ

(t)

) = P (h |

v, θ

(t)

). Under these conditions the lower bound is tight, i.e. L(Q, θ) = P (v | θ)

Salakhutdinov and Hinton (2009a) refer to this algorithm as persistent contrastive diver-

gence. We prefer to distinguish the variational version of the algorithm as applied to DBMs

from the original stochastic maximum likelihood algorithm that directly (though stochastically)

maximizes the likelihood rather than a lower bound on the likelihood as we are doing here.

574

CHAPTER 20. DEEP GENERATIVE MODELS

the posterior P (h | v), one may have to wait a signiﬁcant amount of time for the

samples to “burn-in”. The necessity for this burn-in renders CD an impractical

algorithm for training CD. As far as we known, variants of CD that make use of

the variational approximation for the positive phase gradient approximation have

not been unexplored.

Variational stochastic maximum likelihood as applied to the DBM is given in

Algorithm 20.3. Recall that we have included the oﬀset parameters in the weight

matrices W

(1)

and W

(2)

. Note that the Gibbs sampling in the negative phase of

the stochastic maximum likelihood algorithm can be divided into two blocks of

updates, one including all odd layers (including the visible layer) and the other

including all even layers. Due to the DBM connection pattern, given the even

layers, the distribution over the odd layers is factorial and thus can be sampled

simultaneously and independently as a block. Likewise given the odd layers, the

even layers can be sampled simultaneously and independently as a block.

20.5.4 Practical Training Strategies

Unfortunately, training a DBM using stochastic maximum likelihood (as described

above) from a random initialization usually results in failure. In some cases, the

model fails to learn to represent the distribution adequately. In other cases, the

DBM may represent the distribution well, but with no higher likelihood than

could be obtained with just an RBM. Note that a DBM with very small weights

in all but the ﬁrst layer represents approximately the same distribution as an

RBM.

It is not clear exactly why this happens. When DBMs are initialized from a

pretrained conﬁguration, training usually succeeds. See section 20.5.4 for details.

One possibility is that it is diﬃcult to coordinate the learning rate of the stochastic

gradient algorithm with the number of Gibbs steps used in the negative phase

of stochastic maximum likelihood. SML relies on the learning rate being small

enough relative to the number of Gibbs steps that the Gibbs chain can mix again

after each update to the model parameters. The distribution represented by

the model can change very rapidly during the earlier parts of training, and this

may make it diﬃcult for the negative chains employed by SML to fully mix.

As described in section 20.5.5, multi-prediction deep Boltzmann machines avoid

the potential inaccuracy of SML by training with a diﬀerent objective function

that is less principled but easier to compute. Another possible explanation for

the failure of joint training with mean ﬁeld and SML is that the Hessian matrix

could be poorly conditioned. This perspective motivates centered deep Boltzmann

machines, presented in section 20.5.6, which modify the model family in order to

obtain a better conditioned Hessian matrix.

575

CHAPTER 20. DEEP GENERATIVE MODELS

Figure 20.3: The deep Boltzmann machine training procedure used to obtain the state of

the art classiﬁcation accuracy on the MNIST dataset (Srivastava et al., 2014; Salakhutdi-

nov and Hinton, 2009a). TODO: this is not state of the art anymore, just best DBM result

a) Train an RBM by using CD to approximately maximize logP (v). b) Train a second

RBM that models h

(1)

and y by using CD-k to approximately maximize log P (h

(1)

, y)

where h

(1)

is drawn from the ﬁrst RBM’s posterior conditioned on the data. Increase

k from 1 to 20 during learning. c) Combine the two RBMs into a DBM. Train it to

approximately maximize logP(v, y) using stochastic maximum likelihood with k = 5. d)

Delete y from the model. Deﬁne a new set of features h

(1)

and h

(

2) that are obtained

by running mean ﬁeld inference in the model lacking y. Use these features as input to an

MLP whose structure is the same as an additional pass of mean ﬁeld, with an additional

output layer for the estimate of y. Initialize the MLP’s weights to be the same as the

DBM’s weights. Train the MLP to approximately maximize log P (y | v) using stochastic

gradient descent and dropout. Figure reprinted from (Goodfellow et al., 2013b).

Layerwise Pretraining

The original and most popular method for overcoming the joint training problem

of DBMs is greedy layerwise pretraining. In this method, each layer of the DBM is

trained in isolation as an RBM. The ﬁrst layer is trained to model the input data.

Each subsequent RBM is trained to model samples from the previous RBM’s

posterior distribution. After all of the RBMs have been trained in this way, they

can be combined to form a DBM. The DBM may then be trained with PCD.

Typically PCD training will only make a small change in the model’s parameters

and its performance as measured by the log likelihood it assigns to the data, or

its ability to classify inputs.

Note that this greedy layerwise training procedure is not just coordinate as-

cent. It bears some passing resemblance to coordinate ascent because we optimize

one subset of the parameters at each step. However, in the case of the greedy

layerwise training procedure, we actually use a diﬀerent objective function at each

step.

TODO: details of combining stacked RBMs into a DBM TODO: partial mean

ﬁeld negative phase

576

CHAPTER 20. DEEP GENERATIVE MODELS

20.5.5 Multi-Prediction Deep Boltzmann Machines

TODO– cite stoyanov TODO

20.5.6 Centered Deep Boltzmann Machines

TODO

This chapter has described the tools needed to ﬁt a very broad class of proba-

bilistic models. Which tool to use depends on which aspects of the log-likelihood

are problematic.

For the simplest distributions p, the log-likelihood is tractable, and the model

can be ﬁt with a straightforward application of maximum likelihood estimation

and gradient ascent as described in chapter

In this chapter, I’ve shown what to do in two diﬀerent diﬃcult cases. If Z is in-

tractable, then one may still use maximum likelihood estimation via the sampling

approximation techniques described in section 18.2. If p(h | v) is intractable, one

may still train the model using the negative variational free energy rather than

the likelihood, as described in 19.4.

It is also possible that both of these diﬃculties will arise. An example of this

occurs with the deep Boltzmann machine (Salakhutdinov and Hinton, 2009b),

which is essential a sequence of RBMs composed together. The model is depicted

graphically in Fig. 20.1c.

This model still has the same problem with computing the partition function

as the simpler RBM does. It has also discarded the restricted structure that

made P (h | v) easy to represent in the RBM. The typical way to train the DBM

is to minimize the variational free energy rather than maximize the likelihood.

Of course, the variational free energy still depends on the partition function, so

it is necessary to use sampling techniques to approximate its gradient.

TODO: k-NADE

20.6 Boltzmann Machines for Real-Valued Data

While Boltzmann machines were originally developed for use with binary data,

many applications such as image and audio modeling seem to require the ability

to represent probability distributions over real values. In some cases, it is possible

to treat real-valued data in the interval [0, 1] as representing the expectation of a

binary variable (TODO cite some examples). However, this is not a particularly

theoretically satisfying approach.

577

CHAPTER 20. DEEP GENERATIVE MODELS

Figure 20.4: TODO caption and label, reference from text Figure reprinted from (Good-

fellow et al., 2013b).

578

CHAPTER 20. DEEP GENERATIVE MODELS

20.6.1 Gaussian-Bernoulli RBMs

TODO– cite exponential family harmoniums? TODO– multiple ways of parametriz-

ing them (citations?)

20.6.2 mcRBMs

TODO–mcRBMs

TODO–HMC

20.6.3 mPoT Model

TODO–mPoT

20.6.4 Spike and Slab Restricted Boltzmann Machines

Spike and slab restricted Boltzmann machines (Courville et al., 2011) or ssRBMs

provide another means of modeling the covariance structure of real-valued data.

Compared to mcRBMs, ssRBMs have the advantage of requiring neither matrix

inversion nor Hamiltonian Monte Carlo methods.

The spike and slab RBM has two sets of hidden units: the spike units h which

are binary, and the slab units s which are real-valued. The mean of the visible

units conditioned on the hidden units is given by (h  s)W

. In other words,

each column W

:,i

deﬁnes a component that can be appear in the input. The cor-

responding spike variable h

determines whether that component is present at all.

The corresponding slab variable s

determines the brightness of that component,

if it is present. When a spike variable is active, the corresponding slab variable

adds variance to the input along the axis deﬁned by W

:,i

. This allows us to model

the covariance of the inputs. Fortunately, contrastive divergence and persistent

contrastive divergence with Gibbs sampling are still applicable. There is no need

to invert any matrix.

Gating by the spike variables means that the true marginal distribution over

h  s is sparse. This is diﬀerent from sparse coding, where samples from the

model “almost never” (in the measure theoretic sense) contain zeros in the code,

and MAP inference is required to impose sparsity.

The primary disadvantage of the spike and slab restricted Boltzmann machine

is that some settings of the parameters can correspond to a covariance matrix

that is not positive deﬁnite. Such a covariance matrix places more unnormalized

probability on values that are farther from the mean, causing the integral over

all possible outcomes to diverge. Generally this issue can be avoided with simple

heuristic tricks. There is not yet any theoretically satisfying solution. Using

The term “mcRBM” is pronounced by saying the name of the letters M-C-R-B-M; the ”mc”

is not pronounced like the “Mc” in “McDonald’s.”

579

CHAPTER 20. DEEP GENERATIVE MODELS

constrained optimization to explicitly avoid the regions where the probability is

undeﬁned is diﬃcult to do without being overly conservative and also preventing

the model from accessing high-performing regions of parameter space.

Qualitatively, convolutional variants of the ssRBM produce excellent samples

of natural images. Some examples are shown in Fig. 13.1.

The ssRBM allows for several extensions. Including higher-order interactions

and average-pooling of the slab variables (Courville et al., 2014) enables the model

to learn excellent features for a classiﬁer when labeled data is scarce. Adding a

term to the energy function that prevents the partition function from becoming

undeﬁned results in a sparse coding model, spike and slab sparse coding (Good-

fellow et al., 2013c), also known as S3C.

20.7 Convolutional Boltzmann Machines

As seen in chapter 9, extremely high dimensional inputs such as images place

great strain on the computation, memory, and statistical requirements of machine

learning models. Replacing matrix multiplication by discrete convolution with a

small kernel is the standard way of solving these problems for inputs that have

translation invariant spatial or temporal structure. Desjardins and Bengio (2008)

showed that this approach works well when applied to RBMs.

Deep convolutional networks usually require a pooling operation so that the

spatial size of each successive layer decreases. Feedforward convolutional networks

often use a pooling function such as the maximum of the elements to be pooled. It

is unclear how to generalize this to the setting of energy-based models. We could

introduce a binary pooling unit p over n binary detector units d and enforce

p = max

by setting the energy function to be ∞ whenever that constraint is

violated. This does not scale well though, as it requires evaluating 2

diﬀerent

energy conﬁgurations to compute the normalization constant. For a small 3 × 3

pooling region this requires 2

= 512 energy function evaluations per pooling unit!

Lee et al. (2009) developed a solution to this problem called probabilistic max

pooling (not to be confused with “stochastic pooling,” which is a technique for

implicitly constructing ensembles of convolutional feedforward networks). The

strategy behind probabilistic max pooling is to constrain the detector units so

at most one may be active at a time. This means there are only n + 1 total

states (one state for each of the n detector units being on, and an additional state

corresponding to all of the detector units being oﬀ). The pooling unit is on if

and only if one of the detector units is on. The state with all units oﬀ is assigned

energy zero. We can think of this as describing a model with a single variable that

has n + 1 states, or equivalently as model that has n + 1 variables that assigns

energy ∞ to all but n + 1 joint assignments of variables.

580

CHAPTER 20. DEEP GENERATIVE MODELS

While eﬃcient, probabilistic max pooling does force the detector units to be

mutually exclusive, which may be a useful regularizing constraint in some contexts

or a harmful limit on model capacity in other contexts. It also does not support

overlapping pooling regions. Overlapping pool regions are usually required to

obtain the best performance from feedforward convolutional networks, so this

constraint probably greatly reduces the performance of convolutional Boltzmann

machines.

Lee et al. (2009) demonstrated that probabilistic max pooling could be used

to build convolutional deep Boltzmann machines

. This model is able to perform

operations such as ﬁlling in missing portions of its input. However, it has not

proven especially useful as a pretraining strategy for supervised learning, per-

forming similarly to shallow baseline models introduced by Pinto et al. (2008).

TODO: comment on partition function changing when you change the image

size, boundary issues

20.8 Other Boltzmann Machines

TODO–Conditional Boltzmann machine TODO-RNN-RBM TODO–discriminative

Boltzmann machine TODO–Heng’s class relevant and irrelevant Boltzmann ma-

chines TODO– Honglak’s recent work

20.9 Directed Generative Nets

So far in this chapter we have focused on undirected generative models, in the

deep learning context these are almost always parametrized via an energy function

E and possess an intractable partition function Z. The exception being the deep

belief net which can be characterized as a hybrid directed / undirected model.

As discussed in Chapter 13, directed graphical models make up a second

prominent class of graphical models. While directed graphical models have been

the very popular within the greater Machine Learning community, within the

smaller Deep Learning community they have until recently been overshadowed

by undirected models such as the RBM.

In this section we will consider some of the standard directed graphical models

that have traditionally associated with the deep learning community

The publication describes the model as a ”deep belief network” but because it can be de-

scribed as a purely undirected model with tractable layer-wise mean ﬁeld ﬁxed point updates, it

best ﬁts the deﬁnition of a deep Boltzmann machine.

The list of directed graphical models that we cover here is inevitably going to be incomplete.

The choice of models we include has more to do their prominence within the Deep Learning

581

CHAPTER 20. DEEP GENERATIVE MODELS

TODO: sigmoid belief nets TODO: refer back to DBN TODO: sparse coding

(maybe drop this from the list, covered elsewhere) TODO: deconvolutional nets?

(AC votes for dropping this) TODO: refer back to S3C and BSC (binary sparse

coding) TODO: NADE will be in RNN chapter, refer back to it here make sure

k-NADE and multi-NADE are mentioned somewhere

TODO: refer to DARN and NVIL?

TODO: Stochastic Feedforward nets

20.9.1 Sigmoid Belief Nets

Sigmoid Belief Nets were originally conceived in response to the Neal (1992)

is one of the ﬁrst

20.9.2 Diﬀerentiable Generator Nets

TODO describe how VAEs and GANs both use the same kind of generator net

cite the generating chairs paper to show how this generator net can be trained

with a procedure that isn’t explicitly unsupervised cite both Kevin and Zoubin’s

version of training a generator net with MMD

20.9.3 Variational Autoencoders

The variational autoencoder is model

L (20.35)

TODO

20.9.4 Variational Interpretation of PSD

TODO, develop the explanation of Sec. 9.1 of Bengio et al. (2013c).

20.9.5 Generative Adversarial Networks

TODO: do we want to still use the capital value function here? Should we say

it’s a functional? Note that the G is OK because it’s a distribution

Generative adversarial networks (TODO cite) are another kind of generative

model based on diﬀerentiable mappings from input noise to samples that resemble

the data. In this sense, they closely resemble variational autoencoders. However,

the training procedure is diﬀerent, and generative model is not necessarily coupled

with an inference network. It is theoretically possible to train an inference network

community and the perceived impact that they have had on the community.

582

CHAPTER 20. DEEP GENERATIVE MODELS

using a strategy similar to the wake-sleep algorithm, but there is no need to infer

posterior variables during training.

Generative adversarial networks are based on game theory. A generator net-

work is trained to map input noise z to samples x. This function g(z) deﬁnes

the generative model. The distribution p(z) is not learned; it is simply ﬁxed to

some distribution at the start of training (usually a very unstructured distribu-

tion such as a normal or uniform distribution). We can think of g(z) as deﬁning

a conditional distribution

p(x | z) = N(x | g



z),



but in all learning rules we take limit as β → ∞ so we can treat g(z) itself as a

sample and ignore the parametrization of the output distribution.

The generator g is pitted against an adversary: a discriminator network d.

The discriminator network receives data or samples x as input and outputs its

estimate of the probability that x was sampled from the data rather than the

model. During training, d tries to maximize and g tries to minimize a value

function measuring the log probability of d being correct:

∗

= arg min

max

V (g, d)

where

v(g, d) = E

x∼p

data

log d(x) + E

model

log (1 − d(x)) .

The optimization of g can be done simply by backpropagating through d then

g, so the learning process requires neither approximate inference nor approxima-

tion of a partition function gradient. In the case where max

v(g, d) is convex

(such as the case where optimization is performed directly in the space of prob-

ability density functions) then the procedure is guaranteed to converge and is

asymptotically consistent. In practice, the procedure can be diﬃcult to make

work, because it can be diﬃcult to keep d optimized well enough to provide a

good estimate of how to update g at all times.

20.9.6 Convolutional Generative Networks

TODO– discuss convolutional generator nets (was GANs paper the ﬁrst?) and

be sure to cover “unpooling” include the unpooling technique from generatoring

chairs paper, and include others if there are relevant others

20.10 A Generative View of Autoencoders

Many kinds of autoencoders can be viewed as probabilistic models. Diﬀerent

autoencoders can be interpreted as probabilistic models in diﬀerent ways.

583

CHAPTER 20. DEEP GENERATIVE MODELS

One of the ﬁrst probabilistic interpretations of autoencoders was the view

denoising autoencoders as energy-based models trained using regularized score

matching. See Sections 15.9.1 and 18.5 for details. Since the early work (Vin-

cent, 2011a) made the connection with Gaussian RBMs, this gave denoising auto-

encoders with a particular parametrization a generative interpretation (they could

be sampled from using the MCMC sampling techniques for Gaussian RBMs) .

The next milestone in connecting auto-encoders with a generative interpreta-

tion came with the work of Rifai et al. (2012). It relied on the view of contractive

auto-encoders as estimators of the tangent of the manifold near which probability

concentrates, discussed in Section 15.10 (see also Figures 15.9, 17.3). In this con-

text, Rifai et al. (2012) demonstrated experimentally that good samples could

be obtained from a trained contractive auto-encoder by alternating encoding,

decoding, and adding noise in a particular way.

As discussed in Section 15.9.1, the application of the encoder/decoder pair

moves the input conﬁguration towards a more probable one. This can be exploited

to actually sample from the estimated distribution. If you consider most Monte-

Carlo Markov Chain (MCMC) algorithms, they have two elements:

1. move from lower probability conﬁgurations towards higher probability con-

ﬁgurations, and

2. inject randomness so that the chain moves around (and does not stay stuck

at some peak of probability, or mode) and has a chance to visit every conﬁg-

uration in the whole space, with a relative frequency equal to its probability

under the underlying model.

So conceptually all one needs to do is to perform encode-decode operations (go

towards more probable conﬁgurations) as well as inject noise (to move around the

probable conﬁgurations), as hinted at in (Mesnil et al., 2012; Rifai et al., 2012).

20.10.1 Markov Chain Associated with any Denoising Auto-Encoder

The above discussion left open the question of what noise to inject and where, in

order to obtain a Markov chain that would generate from the distribution esti-

mated by the auto-encoder. Bengio et al. (2013b) showed how to construct such

a Markov chain for generalized denoising autoencoders. Generalized denoising au-

toencoders are speciﬁed by a denoising distribution for sampling an estimate of

the clean input given the corrupted input.

584

CHAPTER 20. DEEP GENERATIVE MODELS

˜x

= g(h

)

= f(˜x

)

P ( X|ω)

X|X)

t+1

Figure 20.5: Each step of the Markov chain associated with a trained denoising auto-

encoder, that generates the samples from the probabilistic model implicitly trained by

the denoising reconstruction criterion. Each step consists in (a) injecting corruption

C in state x, yielding

x, (b) encoding it with f, yielding h = f(

x), (c) decoding the

result with g, yielding parameters ω for the reconstruction distribution, and (d) given ω,

sampling a new state from the reconstruction distribution P (x | ω = g(f (

x))). In the

typical squared reconstruction error case, g(h) =

x, which estimates E[x |

x], corruption

consists in adding Gaussian noise and sampling from P (x | ω) consists in adding another

Gaussian noise to the reconstruction

x. The latter noise level should correspond to the

mean squared error of reconstructions, whereas the injected noise is a hyperparameter

that controls the mixing speed as well as the extent to which the estimator smoothes the

empirical distribution (Vincent, 2011b). In the ﬁgure, only the C and P conditionals

are stochastic steps (f and g are deterministic computations), although noise can also

be injected inside the auto-encoder, as in generative stochastic networks (Bengio et al.,

2014b)

Each step of the Markov chain that generates from the estimated distribution

consists of the following sub-steps, illustrated in Figure 20.5:

1. starting from the previous state x, inject corruption noise, sampling

x from

x | x).

2. Encode

x into h = f(

x).

3. Decode h to obtain the parameters ω = g(h) of P (x | ω = g(h)) = P(x |

x).

4. Sample the next state x from P (x | ω = g(h)) = P (x |

x).

The theorem states that if the auto-encoder P (x |

x) forms a consistent estimator

of corresponding true conditional distribution, then the stationary distribution of

585

CHAPTER 20. DEEP GENERATIVE MODELS

the above Markov chain forms a consistent estimator (albeit an implicit one) of

the data generating distribution of x.

P (

x|x)

P (

x|x)

P (

x|x)

P (x|

Figure 20.6: Illustration of one step of the sampling Markov chain associated with a

denoising auto-encoder (see also Figure 20.5). In the ﬁgure, the data (black circles) are

sitting near a low-dimensional manifold (a spiral, here), and the two stochastic steps of

the Markov chain are ﬁrst to corrupt x (clean image of dog, green circle) into

x (noisy

image of dog, blue circle) via C(

x | x) (here an isotropic Gaussian noise in green), and

then to sample a new x via the estimated denoising P(x |

x). Note how there are many

possible x which could have given rise to

x, and these all lie on the manifold in the

neighborhood of

x, hence the ﬂattened shape of P (x |

x) (in blue). Modiﬁed from a

ﬁgure ﬁrst created and graciously authorized by Jason Yosinski.

Figure 20.6 illustrates the sampling process of the DAE in a way that com-

plements Figure 20.5, with a speciﬁc imagined example. For a more elaborate

discussion of the probabilistic nature of denoising auto-encoders, and their gen-

eralization (Bengio et al., 2014b), Generative Stochastic Networks (GSNs), see

Section 20.11 below. In particular, the noise does not have to be injected only in

the input, and it could be injected anywhere along the chain. GSNs also gener-

alize DAEs by allowing the state of the Markov chain to be extended beyond the

visible variable x, to include also some latent variable h. Finally, Section 20.11

discusses training strategies for DAEs that are aimed at making it a better gen-

586

CHAPTER 20. DEEP GENERATIVE MODELS

erative model and not just a good feature learner.

Figure 20.7: Illustration of the eﬀect of the walk-back training procedure, used for denois-

ing auto-encoders or GSNs in general. The objective is to remove spurious modes faster

by letting the Markov chain go towards them (along the red path, starting on the purple

data manifold and following the arrows plus noise), and then punishing the Markov chain

for this behavior (i.e., walking back to the right place) by telling the chain to return

towards the data manifold (reconstruct the original data).

20.10.2 Clamping and Conditional Sampling

Similarly to Boltzmann machines, denoising auto-encoders and GSNs can be used

to sample from a conditional distribution P (x

| x

), simply by clamping the

observed units x

and only resampling the free units x

given x

and the sampled

latent variables (if any). This has been introduced by Bengio et al. (2014b).

However, note that Proposition 1 of that paper is missing a condition: the

transition operator (deﬁned by the stochastic mapping going from one state of

the chain to the next) should satisfy detailed balance, described in Section 14.1.1.

An experiment in clamping half of the pixels (the right part of the image) and

running the Markov chain on the other half is shown in Figure 20.8.

587

CHAPTER 20. DEEP GENERATIVE MODELS

Figure 20.8: Illustration of clamping the right half of the image and running the Markov

by resampling only the left half at each step. These samples come from a GSN trained

to reconstruct MNIST digits at each time step, i.e., using the walkback procedure.

20.10.3 Walk-Back Training Procedure

The walk-back training procedure was proposed by Bengio et al. (2013b) as a way

to speed-up the convergence of generative training of denoising auto-encoders.

Instead of performing a one-step encode-decode reconstruction, this procedure

consists in alternative multiple stochastic encode-decode steps (as in the genera-

tive Markov chain) initialized at a training example (just like with the contrastive

divergence algorithm, described in Sections 18.2) and 20.3.1) and penalizing the

last probabilistic reconstructions (or all of the reconstructions along the way).

It was shown in that paper that training with k steps is equivalent (in the

sense of achieving the same stationary distribution) as training with one step,

but practically has the advantage that spurious modes farther from the data can

be removed more eﬃciently, as illustrated in Figure 20.7.

Figure 20.9 illustrates the application of the walk-back procedure in a gener-

ative stochastic network, which is described in the next section.

588

CHAPTER 20. DEEP GENERATIVE MODELS

data target target target

Figure 20.9: Left: graphical model of the generative Markov chain associated with a gen-

erative stochastic network (GSN). Right speciﬁc case where the latent variable is formed

of several layers, each connected to the one above and the one below, making the genera-

tive process very similar to Gibbs sampling in a deep Boltzmann machine (Salakhutdinov

and Hinton, 2009b). The walk-back training procedure is used, i.e., at every step the re-

construction probability distribution is pushed towards generating the training example

(which also initializes the chain).

20.11 Generative Stochastic Networks

Generative stochastic networks (Bengio et al., 2014b) or GSNs are generalizations

of denoising auto-encoders that include latent variables in the generative Markov

chain, in addition to the visible variables (usually denoted x). The generative

Markov chain looks like the one in Figure 20.10. An example of a GSN structured

like a deep Boltzmann machine and trained by the walk-back procedure is shown

in Figure 20.9.

Figure 20.10: Markov chain of a GSN (Generative Stochastic Network) with latent vari-

ables with H and visible variable X, i.e., an unfolding of the generative process with X

and H

at step k of the chain. TODO: please use h and x etc. throughout the GSN

section

A GSN is parametrized by two conditional probability distributions which

specify one step of the Markov chain:

1. P (X

| H

) tells how to generate the next visible variable given the current

latent state. Such a “reconstruction distribution” is also found in denoising

auto-encoders, RBMs, DBNs and DBMs.

2. P (H

| H

k−1

, X

k−1

) tells how to update the latent state variable, given the

previous latent state and visible variable.

589

CHAPTER 20. DEEP GENERATIVE MODELS

Denoising auto-encoders and GSNs diﬀer from classical probabilistic models

(directed or undirected) in that it parametrizes the generative process itself rather

than the mathematical speciﬁcation of the joint distribution of visible and latent

variables. Instead, the latter is deﬁned implicitly, if it exists, as the stationary

distribution of the generative Markov chain. The conditions for existence of the

stationary distribution are mild (basically, the chain mixes) but can be violated

by some choices of the transition distributions (for example, if they were deter-

ministic).

One could imagine diﬀerent training criteria for GSNs. The one proposed and

evaluated by Bengio et al. (2014b) is simply reconstruction log-probability on the

visible units, just like for denoising auto-encoders. This is achieved by clamping

= x to the observed example and maximizing the probability of generating x

at some subsequent time steps, i.e., maximizing log P (X

= x | H

), where H

is sampled from the chain, given X

= x. In order to estimate the gradient of

log P (X

= x | H

) with respect to the other pieces of the model, Bengio et al.

(2014b) use the reparametrization trick, introduced in Section 13.5.1.

The walk-back training protocol (described in Section 20.10.3 was used (Bengio

et al., 2014b) to improve training convergence of GSNS.

20.11.1 Discriminant GSNs

Whereas the original formulation of GSNs (Bengio et al., 2014b) was meant for

unsupervised learning and implicitly modeling P (x) for observed data x, it is

possible to modify the framework to optimize P (y | x).

For example, Zhou and Troyanskaya (2014) generalize GSNs in this way, by

only back-propagating the reconstruction log-probability over the output vari-

ables, keeping the input variables ﬁxed. They applied this successfully to model

sequences (protein secondary structure) and introduced a (one-dimensional) con-

volutional structure in the transition operator of the Markov chain. Keep in mind

that, for each step of the Markov chain, one generates a new sequence for each

layer, and that sequence is the input for computing other layer values (say the

one below and the one above) at the next time step, as illustrated in Figure 20.11.

Hence the Markov chain is really over the output variable (and associated

higher-level hidden layers), and the input sequence only serves to condition that

chain, with back-propagation allowing to learn how the input sequence can con-

dition the output distribution implicitly represented by the Markov chain. It is

therefore a case of using the GSN in the context of structured outputs, where

P (y | x) does not have a simple parametric form but instead the components of

y are statistically dependent of each other, given x, in complicated ways.

Z¨ohrer and Pernkopf (2014) considered a hybrid model that combines a su-

pervised objective (as in the above work) and an unsupervised objective (as in

590

CHAPTER 20. DEEP GENERATIVE MODELS

Figure 20.11: Markov chain arising out of a discriminant GSN, i.e., where a GSN is used

as a structured output model over a variable y, conditioned on an input X. Reproduced

with permission from Zhou and Troyanskaya (2014). The structure is as in a GSN (over

the output) but with computations being conditioned on the input X at each step.

the original GSN work), by simply adding (with a diﬀerent weight) the super-

vised and unsupervised costs i.e., the reconstruction log-probabilities of y and x

respectively. Such a hybrid criterion had previously been introduced for RBMs

by Larochelle and Bengio (2008a). They show improved classiﬁcation performance

using this scheme.

20.12 Methodological Notes

Researchers studying generative models often need to compare one generative

model to another, usually in order to demonstrate that a newly invented genera-

tive model is better at capturing some distribution than the pre-existing models.

This can be a diﬃcult and subtle task. In many cases, we can not actually

evaluate the log probability of the data under the model, but only an approxi-

mation. In these cases, it’s important to think and communicate clearly about

exactly what is being measured. For example, suppose we can evaluate a stochas-

tic estimate of the log-likelihood for model A, and a deterministic lower bound

on the log-likelihood for model B. If model A gets a higher score than model B,

which is better? If we care about determining which model has a better internal

representation of the distribution, we actually cannot tell, unless we have some

way of determining how loose the bound for model B is. However, if we care about

how well we can use the model in practice, for example to perform anomaly de-

tection, then it is fair to say that model A is better based on a criterion speciﬁc

to the practical task of interest, e.g., based on ranking test examples and ranking

criterian such as precision and recall.

Another subtletly of evaluating generative models is that the evaluation met-

rics are often hard research problems in and of themselves. It can be very diﬃcult

to establish that models are being compared fairly. For example, suppose we use

591

CHAPTER 20. DEEP GENERATIVE MODELS

AIS to estimate log Z in order to compute log ˜p(x) − log Z for a new model we

have just invented. A computationally economical implementation of AIS may

fail to ﬁnd several modes of the model distribution and underestimate Z, which

will result in us overestimating log p(x). It can thus be diﬃcult to tell whether a

good likelihood estimate is due to a good model or a bad AIS implementation.

Other ﬁelds of machine learning usually allow for some variation in the pre-

processing of the data. For example, when comparing the accuracy of object

recognition algorithms, it is usually acceptable to preprocess the input images

slightly diﬀerently for each algorithm based on what kind of input requirements

it has. Generative modeling is diﬀerent because changes in preprocessing, even

very small and subtle ones, are completely unacceptable. Any change to the input

data changes the distribution to be captured and fundamentally alters the task.

For example, multiplying the input by 0.1 will artiﬁcially increase likelihood by

10.

Issues with preprocessing commonly arise when benchmarking generative mod-

els on the MNIST dataset, one of the more popular generative modeling bench-

marks. MNIST consists of grayscale images. Some models treat MNIST images

as points in a real vector space, while others treat them as binary. Yet others

treat the grayscale values as probabilities for a binary samples. It is essential to

compare real-valued models only to other real-valued models and binary-valued

models only to other binary-valued models. Otherwise the likelihoods measured

are not on the same space. (For the binary-valued models, the log-likelihood can

be at most 0., while for real-valued models it can be arbitrarily high, since it is

the measurement of a density) Among binary models, it is important to com-

pare models using exactly the same kind of binarization. For example, we might

binarize a gray pixel to 0 or 1 by thresholding at 0.5, or by drawing a random

sample whose probability of being 1 is given by the gray pixel intensity. If we use

the random binarization, we might binarize the whole dataset once, or we might

draw a diﬀerent random example for each step of training and then draw multiple

samples for evaluation. Each of these three schemes yields wildly diﬀerent like-

lihood numbers, and when comparing diﬀerent models it is important that both

models use the same binarization scheme for training and for evaluation. In fact,

researchers who apply a single random binarization step share a ﬁle containing

the results of the random binarization, so that there is no diﬀerence in results

based on diﬀerent outcomes of the binarization step.

Finally, in some cases the likelihood seems not to measure any attribute of

the model that we really care about. For example, real-valued models of MNIST

can obtain arbitrarily high likelihood by assigning arbitrarily low variance to

background pixels that never change. Models and algorithms that detect these

constant features can reap unlimited rewards, even though this is not a very useful

592

CHAPTER 20. DEEP GENERATIVE MODELS

thing to do. This strongly suggests a need for developing other ways of evaluating

generative models.

Although this is still an open question, this might be achieved by converting

the problem into a classiﬁcation task. For example, we have seen that the NCE

method (Noise Contrastive Estimation, Section 18.6) compares the density of the

training data according to a learned unnormalized model with its density under

a background model. However, generative models do not always provide us with

an energy function (equivalently, an unnormalized density), e.g., deep Boltzmann

machines, generative stochastic networks, most denoising auto-encoders (that are

not guaranteed to correspond to an energy function), deep Belief networks, etc.

Therefore, it would be interesting to consider a classiﬁcation task in which one

tries to distinguish the training examples from the generated examples. This is

precisely what is achieved by the discriminator network of generative adversarial

networks (Section 20.9.5). However, it would require an expensive operation

(training a discriminator) each time one would have to evaluate performance

593

CHAPTER 20. DEEP GENERATIVE MODELS

Algorithm 20.3 The variational stochastic maximum likelihood algorithm for

training a 2 hidden-layer DBM.

Set , the step size, to a small positive number

Set k, the number of Gibbs steps, high enough to allow a Markov chain of

p(v, h

(1)

, h

(2)

; θ +∆

) toburn in, starting from samples from p(v, h

(1)

, h

(2)

; θ).

Initialize three random matrices,

V ,

(1)

and

(2)

each with m columns set

to random values (e.g., from bernoulli distributions, possibly with marginals

matched to the model’s marginals).

while Not converged (learning loop) do

Sample a minibatch of m examples from the training data and arrange them

as the columns of a matrix V =



(1)

, . . . , v

(m)



from the training set.

while Not converged (Mean-ﬁeld inference loop) do

(1)

← sigmoid



(1)

(2) >



(2)

← sigmoid



(1)

(2)



end while

∆

(1)

←

(1) >

∆

(2)

←

(1)

(2) >

for l = 1 to k (Gibbs sampling) do

Gibbs block 1:

V sampled from

i=1

a=1

sigmoid



(1)

a,:

(2)

:,i



(2)

sampled from

i=1

b=1

sigmoid



(1) >

:,i

(2)

:,b



Gibbs block 2:

(1)

sampled from

i=1

j=1

sigmoid



:,i

(1)

:,j

+ W

(2)

j,:

(2)

:,i



end for

∆

(1)

← ∆

(1)

−

(1) >

∆

(2)

← ∆

(2)

−

(1)

(2) >

(1)

← W

(1)

+ ∆

(1)

(2)

← W

(2)

+ ∆

(2)

end while

594