Chapter 20

Deep Generative Models

In this chapter, we present several of the speciﬁc kinds of generative models that

can be built and trained using the techniques presented in Chapters 16, 17, 18 and

19. All of these models represent probability distributions over multiple variables

in some way. Some allow the probability distribution function to be evaluated

explicitly. Others do not allow the evaluation of the probability distribution

function, but support operations that implicitly require knowledge of it, such

as drawing samples from the distribution. Some of these models are structured

probabilistic models described in terms of graphs and factors, using the language

of graphical models presented in Chapter 16. Others can not easily be described

in terms of factors, but represent probability distributions nonetheless.

20.1 Boltzmann Machines

Boltzmann machines were originally introduced as a general “connectionist” ap-

proach to learning arbitrary probability distributions over binary vectors (Fahlman

et al., 1983; Ackley et al., 1985; Hinton et al., 1984; Hinton and Sejnowski, 1986).

Variants of the Boltzmann machine that include other kinds of variables have long

ago surpassed the popularity of the original. In this section we brieﬂy introduce

the binary Boltzmann machine and discuss the issues that come up when trying to

train and perform inference in the model.

We deﬁne the Boltzmann machine over a

-dimensional binary random vector

x ∈ {

}

. The Boltzmann machine is an energy-based model (Sec. 16.2.4),

meaning we deﬁne the joint probability distribution over the model variable using

659

CHAPTER 20. DEEP GENERATIVE MODELS

an energy function:

P (x) =

exp (−E(x))

, (20.1)

where

(

) is the energy function and

is the partition function that ensures

that the



(

) = 1. The energy function of the Boltzmann machine is given by

E(x) = −x



Ux − b



x, (20.2)

where

is the “weight” matrix of model parameters and

is the vector of bias

parameters.

In the general setting of the Boltzmann machine, we are given a set of obser-

vations, each of which are

-dimensional. Eq. 20.1 describes the joint probability

distribution over the observed variables. While this scenario is certainly viable,

it does limit the kinds of interactions between the observed variables to those

described by the weight matrix. Speciﬁcally, it means that the probability of one

unit being on is given by a linear model (logistic regression) from the values of the

other units.

The Boltzmann machine becomes more powerful when not all the variables

are observed. In this case, the non-observed variables, or latent variables can

act similarly to hidden units in a multi-layer perceptron and model higher-order

interactions among the visible units. Just as the addition of hidden units to

convert logistic regression into an MLP results in the MLP being a universal

approximator of functions, a Boltzmann machine with hidden units is no longer

limited to modeling linear relationships between variables. Instead, the Boltzmann

machine becomes a universal approximator of probability mass functions over

discrete variables (Le Roux and Bengio, 2008).

Formally, we decompose the units into two subsets: the visible units

and

the latent (or hidden) units

. Without loss of generality, we can re-express the

energy function decomposing x into subsets x

and x

E(x

, x

) = −x



− x



W x

− x



− b



− c



. (20.3)

Boltzmann Machine Learning

Learning algorithms for Boltzmann machines

are usually based on maximum likelihood. All Boltzmann machines have an

intractable partition function, so the maximum likelihood gradient must be ap-

proximated using the techniques described in Chapter 18.

One interesting property of Boltzmann machines when trained with learning

rules based on maximum likelihood is that the update for a particular weight

connecting two units depends only the statistics of those two units, collected

660

CHAPTER 20. DEEP GENERATIVE MODELS

under diﬀerent distributions:

model

(

) and

data

(

)

model

(

h | v

). The rest of the

network participates in shaping those statistics, but the weight can be updated

without knowing anything about the rest of the network or how those statistics were

produced. This means that the learning rule is “local,” which makes Boltzmann

machine learning somewhat biologically plausible. It is conceivable that if each

neuron were a random variable in a Boltzmann machine, then the axons and

dendrites connecting two random variables could learn only by observing the ﬁring

pattern of the cells that they actually physically touch. In particular, in the

positive phase, two units that frequently activate together have their connection

strengthened. This is an example of a Hebbian learning rule (Hebb, 1949) often

summarized with the mnemonic “ﬁre together, wire together.” Hebbian learning

rules are among the oldest hypothesized explanations for learning in biological

systems and remain relevant today (Giudice et al., 2009).

Other learning algorithms that use more information than local statistics seem

to require us to hypothesize the existence of more machinery than this. For

example, for the brain to implement back-propagation in a multilayer perceptron,

it seems necessary for the brain to maintain a secondary communication network for

transmitting gradient information backwards through the network. Proposals for

biologically plausible implementations (and approximations) of back-propagation

have been made (Hinton, 2007a; Bengio, 2015) but remain to be validated, and

Bengio (2015) links back-propagation of gradients to inference in energy-based

models similar to the Boltzmann machine (but with continuous latent variables).

The negative phase of Boltzmann machine learning is somewhat harder to

explain from a biological point of view. As argued in Sec. 18.2, dream sleep may

be a form of negative phase sampling. This idea is more speculative though.

20.2 Restricted Boltzmann Machines

Invented under the name harmonium (Smolensky, 1986), restricted Boltzmann

machines are some of the most common building blocks of deep probabilistic models.

We have brieﬂy described RBMs previously, in Sec. 16.7.1. Here we review the

previous information and go into more detail. RBMs are undirected probabilistic

graphical models containing a layer of observable variables and a single layer of

latent variables. RBMs may be stacked (one on top of the other) to form deeper

models. See Fig. 20.1 for some examples. In particular, Fig. 20.1a shows the graph

structure of the RBM itself. It is a bipartite graph, with no connections permitted

between any variables in the observed layer or between any units in the latent

layer.

661

CHAPTER 20. DEEP GENERATIVE MODELS

(1)

(2)

(1)

a b

(1)

(2)

(1)

Figure 20.1: Examples of models that may be built with restricted Boltzmann machines.

(a) The restricted Boltzmann machine itself is an undirected graphical model based on

a bipartite graph, with visible units in one part of the graph and hidden units in the

other part. There are no connections among the visible units, nor any connections among

the hidden units. Typically every visible unit is connected to every hidden unit but it

is possible to construct sparsely connected RBMs such as convolutional RBMs. (b) A

deep belief network is a hybrid graphical model involving both directed and undirected

connections. Like an RBM, it has no intra-layer connections. However, a DBN has

multiple hidden layers, and thus there are connections between hidden units that are in

separate layers. All of the local conditional probability distributions needed by the deep

belief network are copied directly from the local conditional probability distributions of

its constituent RBMs. Alternatively, we could also represent the deep belief network with

a completely undirected graph, but it would need intra-layer connections to capture the

dependencies between parents. (c) A deep Boltzmann machine is an undirected graphical

model with several layers of latent variables. Like RBMs and DBNs, DBMs lack intra-layer

connections. DBMs are less closely tied to RBMs than DBNs are. When initializing a

DBM from a stack of RBMs, it is necessary to modify the RBM parameters slightly. Some

kinds of DBMs may be trained without ﬁrst training a set of RBMs.

662

CHAPTER 20. DEEP GENERATIVE MODELS

We begin with the binary version of the restricted Boltzmann machine, but as

we see later there are extensions to other types of visible and hidden units.

More formally, let the observed layer consist of a set of

binary random

variables which we refer to collectively with the vector

. We refer to the latent or

hidden layer of binary random variables collectively as h.

Like the general Boltzmann machine, the restricted Boltzmann machine is an

energy-based model with the joint probability distribution speciﬁed by its energy

function:

P (v = v, h = h) =

exp (−E(v, h)) . (20.4)

The energy function for an RBM is given by

E(v, h) = −b



v −c



h − v



W h, (20.5)

and Z is the normalizing constant known as the partition function:

Z =



exp {−E(v, h)}. (20.6)

It is apparent from the deﬁnition of the partition function

that the naive method

of computing

(exhaustively summing over all states) could be computationally

intractable, unless a cleverly designed algorithm could exploit regularities in the

probability distribution to compute

faster. In the case of restricted Boltzmann

machines, Long and Servedio (2010) formally proved that the partition function

is intractable. The intractable partition function

implies that the normalized

joint probability distribution p(v) is also intractable to evaluate.

20.2.1 Conditional Distributions

Though

(

) is intractable, the bipartite graph structure of the RBM has the

very special property that its conditional distributions

(

h | v

) and

(

v | h

) are

factorial and relatively simple to compute and to sample from.

Deriving the conditional distributions from the joint distribution is straightfor-

663

CHAPTER 20. DEEP GENERATIVE MODELS

ward:

P (h | v) =

P (h, v)

P (v)

exp





v + c



h + v



W h





exp





h + v



W h





exp









j=1



j=1



:,j











j=1

exp



+ v



:,j



Since we are conditioning on the visible units

, we can treat these as constants

w.r.t. the distribution

(

h | v

). The factorial nature of the conditional

(

h | v

)

follows immediately from our ability to write the joint probability over the vector

as the product of (unnormalized) distributions over the individual elements,

. It is now a simple matter of normalizing the distributions over the individual

binary h

P (h

= 1 | v) =

P (h

= 1 | v)

P (h

= 0 | v) +

P (h

= 1 | v)

exp



+ v



:,j



exp {0} + exp {c

+ v



:,j

}

= σ



+ v



:,j



. (20.7)

We can now express the full conditional over the hidden layer as the factorial

distribution:

P (h | v) =



j=1



(2h − 1)  (c + W





. (20.8)

A similar derivation will show that the other condition of interest to us,

(

v | h

is also a factorial distribution:

P (v | h) =



i=1

σ ((2v −1)  (b + W h))

. (20.9)

664

CHAPTER 20. DEEP GENERATIVE MODELS

20.2.2 RBM Gibbs Sampling

The factorial nature of these conditional distributions is a very useful property of

the RBM, and allows us to eﬃciently draw samples from the joint distribution via

a block Gibbs sampling strategy (see Sec. 17.2 for a more complete discussion of

Gibbs sampling methods).

Block Gibbs sampling simply refers to the situation where in each step of Gibbs

sampling, multiple variables (or a “block” of variables) are sampled jointly. In the

case of the RBM, each iteration of block Gibbs sampling consists of two steps.

Step 1:

Sample

(l)

∼ P

(

h | v

(l)

). Due to the factorial nature of the conditionals,

we can simultaneously and independently sample from all the elements of

(l)

given

(l)

Step 2:

Sample

(l+1)

∼ P

(

v | h

(l)

). Again, the factorial nature of the

conditional

(

v | h

(l)

) allows us to simultaneously and independently sample from

all the elements of v

(l+1)

given h

(l)

20.2.3 Training Restricted Boltzmann Machines

Because the RBM admits eﬃcient evaluation and diﬀerentiation of

(

) and

eﬃcient MCMC sampling, it can readily be trained with any of the techniques

described in Chapter 18 for training models that have intractable partition functions.

This includes CD, SML (PCD), ratio matching and so on. Compared to other

undirected models used in deep learning, the RBM is relatively straightforward to

train because we can compute

(

h | v

) exactly in closed form. Some other deep

models, such as the deep Boltzmann machine, combine both the diﬃculty of an

intractable partition function and the diﬃculty of intractable inference.

20.3 Deep Belief Networks

Deep belief networks (DBNs) were one of the ﬁrst non-convolutional models to

successfully admit training of deep architectures (Hinton et al., 2006; Hinton,

2007b). The introduction of deep belief networks in 2006 began the current deep

learning renaissance. Prior to the introduction of deep belief networks, deep models

were considered too diﬃcult to optimize. Kernel machines with convex objective

functions dominated the research landscape. Deep belief networks demonstrated

that deep architectures can be successful, by outperforming kernelized support

vector machines on the MNIST dataset (Hinton et al., 2006). Today, deep belief

networks have mostly fallen out of favor and are rarely used, even compared to

other unsupervised or generative learning algorithms, but they are still deservedly

665

CHAPTER 20. DEEP GENERATIVE MODELS

recognized for their important role in deep learning history.

Deep belief networks are generative models with several layers of latent variables.

The latent variables are typically binary, while the visible units may be binary

or real. There are no intra-layer connections. Usually, every unit in each layer is

connected to every unit in each neighboring layer, though it is possible to construct

more sparsely connected DBNs. The connections between the top two layers are

undirected. The connections between all other layers are directed, with the arrows

pointed toward the layer that is closest to the data. See Fig. 20.1b for an example.

A DBN with

hidden layers contains

weight matrices:

(1)

, . . . , W

(l)

. It

also contains

+ 1 bias vectors:

(0)

, . . . , b

(l)

, with

(0)

providing the biases for the

visible layer. The probability distribution represented by the DBN is given by

P (h

(l)

, h

(l−1)

) ∝ exp



(l)



(l)

+ b

(l−1)

(l−1)

+ h

(l−1)

(l)



, (20.10)

P (h

(k)

= 1 | h

(k+1)

) = σ



(k)

+ W

(k+1)

:,i

(k+1)



∀i, ∀k ∈ 1, . . . , l − 2, (20.11)

P (v

= 1 | h

(1)

) = σ



(0)

+ W

(1)

:,i

(1)



∀i. (20.12)

In the case of real-valued visible units, substitute

v ∼ N



v; b

(0)

+ W

(1)

(1)

, β

−1



(20.13)

with

diagonal for tractability. Generalizations to other exponential family visible

units are straightforward, at least in theory. A DBN with only one hidden layer is

just an RBM.

To generate a sample from a DBN, we ﬁrst run several steps of Gibbs sampling

on the top two hidden layers. This stage is essentially drawing a sample from

the RBM deﬁned by the top two hidden layers. We can then use a single pass of

ancestral sampling through the rest of the model to draw a sample from the visible

units.

Deep belief networks incur many of the problems associated with both directed

models and undirected models.

Inference in a deep belief network is intractable due to the explaining away

eﬀect within each directed layer, and due to the interaction between the two hidden

layers that have undirected connections. Evaluating or maximizing the standard

evidence lower bound on the log-likelihood is also intractable, because the evidence

lower bound takes the expectation of cliques whose size is equal to the network

width.

666

CHAPTER 20. DEEP GENERATIVE MODELS

Evaluating or maximizing the log-likelihood requires not just confronting the

problem of intractable inference to marginalize out the latent variables, but also

the problem of an intractable partition function within the undirected model of

the top two layers.

To train a deep belief network, one begins by training an RBM to maximize

v∼p

data

log p

(

) using contrastive divergence or stochastic maximum likelihood.

The parameters of the RBM then deﬁne the parameters of the ﬁrst layer of the

DBN. Next, a second RBM is trained to approximately maximize

v∼p

data

(1)

∼p

(1)

|v)

log p

(2)

(1)

) (20.14)

where

(1)

is the probability distribution represented by the ﬁrst RBM and

(2)

is the probability distribution represented by the second RBM. In other words,

the second RBM is trained to model the distribution deﬁned by sampling the

hidden units of the ﬁrst RBM, when the ﬁrst RBM is driven by the data. This

procedure can be repeated indeﬁnitely, to add as many layers to the DBN as

desired, with each new RBM modeling the samples of the previous one. Each RBM

deﬁnes another layer of the DBN. This procedure can be justiﬁed as increasing a

variational lower bound on the log-likelihood of the data under the DBN (Hinton

et al., 2006).

In most applications, no eﬀort is made to jointly train the DBN after the greedy

layer-wise procedure is complete. However, it is possible to perform generative

ﬁne-tuning using the wake-sleep algorithm.

The trained DBN may be used directly as a generative model, but most of the

interest in DBNs arose from their ability to improve classiﬁcation models. We can

take the weights from the DBN and use them to deﬁne an MLP:

(l)

= σ



(l)

+ h

(l−1)

(l)



∀l ∈ 2, . . . , m, (20.15)

(1)

= σ



(1)

+ v



(1)



. (20.16)

After initializing this MLP with the weights and biases learned via generative

training of the DBN, we may train the MLP to perform a classiﬁcation task.

This additional training of the MLP is known as an example of discriminative

ﬁne-tuning.

This speciﬁc choice of MLP is somewhat arbitrary, compared to many of the

inference equations in Chapter 19 that are derived from ﬁrst principles. This MLP

is a heuristic choice that seems to work well in practice and is used consistently

in the literature. Many approximate inference techniques are motivated by their

667

CHAPTER 20. DEEP GENERATIVE MODELS

ability to ﬁnd a maximally tight variational lower bound on the log-likelihood

under some set of constraints. One can construct a variational lower bound on the

log-likelihood using the hidden unit expectations deﬁned by the DBN’s MLP, but

this is true of any probability distribution over the hidden units, and there is no

reason to believe that this MLP provides a particularly tight bound. In particular,

the MLP ignores many important interactions in the DBN graphical model. The

MLP propagates information upward from the visible units to the deepest hidden

units, but does not propagate any information downward or sideways. The DBN

graphical model has explaining away interactions between all of the hidden units

within the same layer as well as top-down interactions between layers.

While the log-likelihood of a DBN is intractable, it may be approximated with

AIS (Salakhutdinov and Murray, 2008). This permits evaluating its quality as a

generative model.

The term “deep belief network” is commonly used incorrectly to refer to any

kind of deep neural network, even networks without latent variable semantics.

The term “deep belief network” should refer speciﬁcally to models with undirected

connections in the deepest layer and directed connections pointing downward

between all other pairs of consecutive layers.

The term “deep belief network” may also cause some confusion because the

term “belief network” is sometimes used to refer to purely directed models, while

deep belief networks contain an undirected layer. Deep belief networks also share

the acronym DBN with dynamic Bayesian networks (Dean and Kanazawa, 1989),

which are Bayesian networks for representing Markov chains.

20.4 Deep Boltzmann Machines

A deep Boltzmann machine or DBM (Salakhutdinov and Hinton, 2009a) is another

kind of deep, generative model. Unlike the deep belief network (DBN), it is an

entirely undirected model. Unlike the RBM, the DBM has several layers of latent

variables (RBMs have just one). But like the RBM, within each layer, each of the

variables are mutually independent, conditioned on the variables in the neighboring

layers. See Fig. 20.2 for the graph structure. Deep Boltzmann machines have been

applied to a variety of tasks including document modeling (Srivastava et al., 2013).

Like RBMs and DBNs, DBMs typically contain only binary units—as we

assume for simplicity of our presentation of the model—but it is straightforward

to include real-valued visible units.

A DBM is an energy-based model, meaning that the the joint probability

668

CHAPTER 20. DEEP GENERATIVE MODELS

(1)

(2)

(1)

Figure 20.2: The graphical model for a deep Boltzmann machine with one visible layer

(bottom) and two hidden layers. Connections are only between units in neighboring layers.

There are no intra-layer layer connections.

distribution over the model variables is parametrized by an energy function

. In

the case of a deep Boltzmann machine with one visible layer,

, and three hidden

layers, h

(1)

, h

(2)

and h

(3)

, the joint probability is given by:



v, h

(1)

, h

(2)

, h

(3)



Z(θ)

exp



−E(v, h

(1)

, h

(2)

, h

(3)

; θ)



. (20.17)

To simplify our presentation, we omit the bias parameters below. The DBM energy

function is then deﬁned as follows:

E(v, h

(1)

, h

(2)

, h

(3)

; θ) = −v



(1)

− h

(1)

(2)

− h

(2)

(3)

(20.18)

In comparison to the RBM energy function (Eq. 20.5), the DBM energy

function includes connections between the hidden units (latent variables) in the

form of the weight matrices (

(2)

and

(3)

). As we will see, these connections

have signiﬁcant consequences for both the model behavior as well as how we go

about performing inference in the model.

In comparison to fully connected Boltzmann machines (with every unit con-

nected to every other unit), the DBM oﬀers some similar advantages as oﬀered by

the RBM. Speciﬁcally, as illustrated in Fig. 20.3, the DBM layers can be organized

into a bipartite graph, with odd layers on one side and even layers on the other.

This immediately implies that when we condition on the variables in the even layer,

the variables in the odd layers become conditionally independent. Of course, when

we condition on the variables in the odd layers, the variables in the even layers

also become conditionally independent.

669

CHAPTER 20. DEEP GENERATIVE MODELS

(1)

(2)

(3)

(2)

(1)

(3)

Figure 20.3: A deep Boltzmann machine, re-arranged to reveal its bipartite graph structure.

The bipartite structure of the DBM means that we can apply the same equa-

tions we have previously used for the conditional distributions of an RBM to

determine the conditional distributions in a DBM. The units within a layer are

conditionally independent from each other given the values of the neighboring

layers, so the distributions over binary variables can be fully described by the

Bernoulli parameters giving the probability of each unit being active. In our

example with two hidden layers, the activation probabilities are given by:

P (v

= 1 | h

(1)

) = σ



(1)

i,:

(1)



, (20.19)

P (h

(1)

= 1 | v, h

(2)

) = σ





(1)

:,i

+ W

(2)

i,:

(2)



(20.20)

and

P (h

(2)

= 1 | h

(1)

) = σ



(1)

(2)

:,k



. (20.21)

20.4.1 Interesting Properties

Deep Boltzmann machines have many interesting properties.

DBMs were developed after DBNs. Compared to DBNs, the posterior distribu-

tion

(

h | v

) is simpler for DBMs. Somewhat counterintuitively, the simplicity of

this posterior distribution allows richer approximations of the posterior. In the case

670

CHAPTER 20. DEEP GENERATIVE MODELS

of the DBN, we perform classiﬁcation using a heuristically motivated approximate

inference procedure, in which we guess that a reasonable value for the mean ﬁeld

expectation of the hidden units can be provided by an upward pass through the

network in an MLP that uses sigmoid activation functions and the same weights

as the original DBN.

Any

distribution

(

) may be used to obtain a variational

lower bound on the log-likelihood. This heuristic procedure therefore allows us to

obtain such a bound. However, the bound is not explicitly optimized in any way, so

the bound may be far from tight. In particular, the heuristic estimate of

ignores

interactions between hidden units within the same layer as well as the top-down

feedback inﬂuence of hidden units in deeper layers on hidden units that are closer

to the input. Because the heuristic MLP-based inference procedure in the DBN

is not able to account for these interactions, the resulting

is presumably far

from optimal. In DBMs, all of the hidden units within a layer are conditionally

independent given the other layers. This lack of intra-layer interaction makes it

possible to use ﬁxed point equations to actually optimize the variational lower

bound and ﬁnd the true optimal mean ﬁeld expectations (to within some numerical

tolerance).

The use of proper mean ﬁeld allows the approximate inference procedure for

DBMs to capture the inﬂuence of top-down feedback interactions. This makes

DBMs interesting from the point of view of neuroscience, because the human brain

is known to use many top-down feedback connections. Because of this property,

DBMs have been used as computational models of real neuroscientiﬁc phenomena

(Series et al., 2010; Reichert et al., 2011).

One unfortunate property of DBMs is that sampling from them is relatively

diﬃcult. DBNs only need to use MCMC sampling in their top pair of layers. The

other layers are used only at the end of the sampling process, in one eﬃcient

ancestral sampling pass. To generate a sample from a DBM, it is necessary to

use MCMC across all layers, with every layer of the model participating in every

Markov chain transition.

20.4.2 DBM Mean Field Inference

The conditional distribution over one DBM layer given the neighboring layers is

factorial. In the example of the DBM with two hidden layers, these distributions

are

(

v | h

(1)

(

(1)

| v, h

(2)

) and

(

(2)

| h

(1)

). The distribution over

all

hidden layers generally does not factorize because of interactions between layers.

In the example with two hidden layers,

(

(1)

, h

(2)

| v

) does not factorize due due

to the interaction weights

(2)

between

(1)

and

(2)

which render these variables

mutually dependent.

671

CHAPTER 20. DEEP GENERATIVE MODELS

As was the case with the DBN, we are left to seek out methods to approximate

the DBM posterior distribution. However, unlike the DBN, the DBM posterior

distribution over their hidden units—while complicated—is easy to approximate

with a variational approximation (as discussed in Sec. 19.1), speciﬁcally a mean

ﬁeld approximation. The mean ﬁeld approximation is a simple form of variational

inference, where we restrict the approximating distribution to fully factorial distri-

butions. In the context of DBMs, the mean ﬁeld equations capture the bidirectional

interactions between layers. In this section we derive the iterative approximate

inference procedure originally introduced in Salakhutdinov and Hinton (2009a).

In variational approximations to inference, we approach the task of approxi-

mating a particular target distribution—in our case, the posterior distribution over

the hidden units given the visible units—by some reasonably simple family of dis-

tributions. In the case of the mean ﬁeld approximation, the approximating family

is the set of distributions where the hidden units are conditionally independent.

We now develop the mean ﬁeld approach for the example with two hidden

layers. Let

(

(1)

, h

(2)

| v

) be the approximation of

(

(1)

, h

(2)

| v

). The mean

ﬁeld assumption implies that

Q(h

(1)

, h

(2)

| v) =



j=1

Q(h

(1)

| v)



k=1

Q(h

(2)

| v). (20.22)

The mean ﬁeld approximation attempts to ﬁnd a member of this family of

distributions that best ﬁts the true posterior

(

(1)

, h

(2)

| v

). Importantly, the

inference process must be run again to ﬁnd a diﬀerent distribution

every time

we use a new value of v.

One can conceive of many ways of measuring how well

(

h | v

) ﬁts

(

h | v

The mean ﬁeld approach is to minimize

KL(QP ) =



Q(h

(1)

, h

(2)

| v) log



Q(h

(1)

, h

(2)

| v)

P (h

(1)

, h

(2)

| v)



(20.23)

In general, we do not have to provide a parametric form of the approximating

distribution beyond enforcing the independence assumptions. The variational

approximation procedure is generally able to recover a functional form of the

approximate distribution. However, in the case of a mean ﬁeld assumption on

binary hidden units (the case we are developing here) there is no loss of generality

resulting from ﬁxing a parametrization of the model in advance.

We parametrize

as a product of Bernoulli distributions, that is we associate

the probability of each element of

(1)

with a parameter. Speciﬁcally, for each

672

CHAPTER 20. DEEP GENERATIVE MODELS

j ∈ {

, . . . , n}

(1)

(

(1)

= 1

| v

), where

(1)

∈

1] and for each

k ∈

{

, . . . , m}

(2)

(

(2)

= 1

| v

), where

(2)

∈

1]. Thus we have the following

approximation to the posterior:

Q(h

(1)

, h

(2)

| v) =



j=1

Q(h

(1)

| v)



k=1

Q(h

(2)

| v)



j=1

(

(1)

)

(1)

(1 −

(1)

)

(1−h

(1)

)



k=1

(

(2)

)

(2)

(1 −

(2)

)

(1−h

(2)

)

(20.24)

Of course, for DBMs with more layers the approximate posterior parametrization

can be extended in the obvious way.

Now that we have speciﬁed our family of approximating distributions

, it

remains to specify a procedure for choosing the member of this family that best

ﬁts

. The most straightforward way to do this is to use the mean ﬁeld equations

speciﬁed by Eq. 19.44. These equations were derived by solving for where the

derivatives of the variational lower bound are zero. They describe in an abstract

manner how to optimize the variational lower bound for any model, simply by

taking expectations with respect to Q.

Applying these general equations, we obtain the update rules (again, ignoring

bias terms):

(1)

= σ





(1)

i,j





(2)

j,k



(2)





, ∀j (20.25)

(2)

= σ









(2)



(1)







, ∀k. (20.26)

At a ﬁxed point of this system of equations, we have a local maximum of the

variational lower bound

(

). Thus these ﬁxed point update equations deﬁne

an iterative algorithm where we alternate updates of

(1)

(using Eq. 20.25) and

updates of

(2)

(using Eq. 20.26). On small problems such as MNIST, as few

as ten iterations can be suﬃcient to ﬁnd an approximate positive phase gradient

for learning, and ﬁfty usually suﬃce to obtain a high quality representation of

a single speciﬁc example to be used for high-accuracy classiﬁcation. Extending

approximate variational inference to deeper DBMs is straightforward.

673

CHAPTER 20. DEEP GENERATIVE MODELS

20.4.3 DBM Parameter Learning

Because a deep Boltzmann machine contains restricted Boltzmann machines as

components, the hardness results for computing the partition function and sam-

pling that apply to restricted Boltzmann machines also apply to deep Boltzmann

machines. This means that evaluating the probability mass function of a Boltzmann

machine requires approximate methods such as annealed importance sampling.

Likewise, training the model requires approximations to the gradient of the log

partition function. See Chapter 18 for a general description of these methods.

The posterior distribution over the hidden units in a deep Boltzmann machine

is intractable, due to the interactions between diﬀerent hidden layers. This means

that we must use approximate inference during learning. The standard approach

is to use stochastic gradient ascent on the mean ﬁeld lower bound, as described in

Chapter 19. Mean ﬁeld is incompatible with most of the methods for approximating

the gradients of the log partition function described in Chapter 18. Moreover, it has

been observed that for contrastive divergence to work well, it is important that the

samples from the posterior (e.g., for the 2 hidden layer DBM:

(

(1)

, h

(2)

| v

)) be

exact (Salakhutdinov and Hinton, 2009b). In the case of the DBM, the intractability

of the posterior means that we would have to run a Gibbs sampler until the samples

converged to samples from the true posterior (until the chain has mixed). Thus

for the DBM, CD oﬀers no speedup relative to naive MCMC methods. Instead,

DBMs are usually trained using a variant of stochastic maximum likelihood. The

negative phase samples can be generated simply by running a Gibbs sampling

chain that alternates between sampling the odd-numbered layers and sampling the

even-numbered layers.

Learning in the DBM can equivalently be interpreted as performing a variational

form of the Expectation Maximization (EM) algorithm. Speciﬁcally, consider the

variational lower bound for the two-layer DBM (making the dependency on the

model parameters explicit):

L(Q, θ) =





(1)

i,j



(1)











(1)



(2)



(2)



− log Z(θ) + H(Q).

This expression provides a lower bound on the likelihood

(

v | θ

). So by maximiz-

ing this bound we hope to improve the likelihood. Thus we can think of

(

Q, θ

)

as a surrogate objective function for the DBM. From this perspective it is natural

to interpret DBM learning as using an alternating 2-step optimization procedure.

In the ﬁrst step (the E-step or expectation step), we optimize

(

Q, θ

) with respect

to the variational parameters. In the case of the two-layer DBM this amounts to

solving for

(1)

and

(2)

via the iterative scheme introduced above. Then in the

674

CHAPTER 20. DEEP GENERATIVE MODELS

second step (the M-step or maximization step), we optimize

(

Q, θ

) with respect

to the model parameters θ.

Maximizing the variational lower bound with respect to the parameters does

not guarantee that we improve the true likelihood

(

v | θ

) on every step.

That

said, in practice we often ﬁnd that we are able to make progress in training DBMs

by maximizing the lower bound L(Q, θ).

Unlike the standard M-step we typically have as part of the EM algorithm, our

M-step will not actually maximize

(

Q, θ

) with respect to

(holding

ﬁxed).

The presence of the partition function makes it impractical to solve the system of

equations

∇

(

Q, θ

) =

for

. Instead we will be content to make incremental

progress toward this maximum by taking a small step in the direction of the

gradient ∇

L(Q, θ). In the case of the 2-hidden layer DBM, this is given by:

∇

L(Q, θ) =

∂

∂θ









(1)

i,j



(1)











(1)



(2)



(2)



− log Z(θ) + H(Q)





∂

∂θ









(1)

i,j



(1)











(1)



(2)



(2)







−

∂

∂θ

log Z(θ).

(20.27)

The ﬁrst term in Eq. 20.27 is straightforward, once the values of

(1)

and

(2)

have been computed in the E-step. Our use of variational approximate inference

has rendered learning in the DBM as analogous to training in the RBM, in the

sense that the likelihood gradient is composed of an analytically tractable term

and a term involving the gradient of the partition function:

− nE

P (v,h

(1)

(2)

)



∇



−E(v, h

(1)

, h

(2)

)



. (20.28)

Similar to the RBM case, the partition function’s contribution to the gradient of

the variational lower bound is intractable. We approximate it using a variational

version of stochastic maximum likelihood

(VSML) algorithm. The non-variational

version of stochastic maximum likelihood algorithm is discussed in Sec. 18.2.

In standard EM we do have a guarantee. The diﬀerence is that in the case of standard EM

we assume the true posterior is tractable and therefore we can set

(

h | v, θ

(t)

) =

(

h | v, θ

(t)

Under these conditions the lower bound is tight, meaning L(Q, θ) = log P (v | θ).

Salakhutdinov and Hinton (2009a) refer to this algorithm as persistent contrastive divergence.

We prefer to distinguish the variational version of the algorithm as applied to DBMs from the

original stochastic maximum likelihood algorithm that directly (though stochastically) maximizes

the likelihood rather than a lower bound on the likelihood.

675

CHAPTER 20. DEEP GENERATIVE MODELS

Unlike in the RBM, the interaction between the hidden units of the DBM

precludes a direct application of the contrastive divergence training algorithm.

Speciﬁcally the issue is that, in the positive phase, in order to get samples from

the posterior

(

h | v

), one may have to wait a signiﬁcant amount of time for

the samples to mix. The necessity for this burn-in renders CD an impractical

algorithm for training the DBM.

Variational stochastic maximum likelihood as applied to the DBM is given in

Algorithm 20.1. Recall that we have included the bias parameters in the weight

matrices

(1)

and

(2)

. Gibbs sampling in the negative phase of the stochastic

maximum likelihood algorithm can be divided into two blocks of updates, one

including all odd layers (including the visible layer) and the other including all

even layers. Due to the DBM connection pattern, given the even layers, the

distribution over the odd layers is factorial and thus can be sampled simultaneously

and independently as a block. Likewise given the odd layers, the even layers can

be sampled simultaneously and independently as a block.

20.4.4 Practical Training Strategies

Unfortunately, training a DBM using stochastic maximum likelihood (as described

above) from a random initialization usually results in failure. In some cases, the

model fails to learn to represent the distribution adequately. In other cases, the

DBM may represent the distribution well, but with no higher likelihood than could

be obtained with just an RBM. A DBM with very small weights in all but the ﬁrst

layer represents approximately the same distribution as an RBM.

It is not clear exactly why this happens. When DBMs are initialized from a

pretrained conﬁguration, training usually succeeds. One possibility is that it is

diﬃcult to coordinate the learning rate of the stochastic gradient algorithm with

the number of Gibbs steps used in the negative phase of stochastic maximum

likelihood. SML relies on the learning rate being small enough relative to the

number of Gibbs steps that the Gibbs chain can mix again after each update to

the model parameters. The distribution represented by the model can change very

rapidly during the earlier parts of training, and this may make it diﬃcult for the

negative chains employed by SML to fully mix. As described in Sec. 20.4.5, two

approaches can reduce these problems. Multi-prediction deep Boltzmann machines

avoid the potential inaccuracy of SML by training with a diﬀerent objective function

that is less principled but easier to compute. Another possible explanation for the

failure of joint training with mean ﬁeld and SML is that the Hessian matrix of the

cost function could be poorly conditioned. This perspective motivates centered

deep Boltzmann machines, described below, which modify the model family in

676

CHAPTER 20. DEEP GENERATIVE MODELS

Algorithm 20.1

The variational stochastic maximum likelihood algorithm for

training a DBM with two hidden layers.

Set , the step size, to a small positive number

Set

, the number of Gibbs steps, high enough to allow a Markov chain of

(

v, h

(1)

, h

(2)

;



∆

) to burn in, starting from samples from

(

v, h

(1)

, h

(2)

;

Initialize three matrices,

(1)

and

(2)

each with

columns set to random

values (e.g., from Bernoulli distributions, possibly with marginals matched to

the model’s marginals).

while not converged (learning loop) do

Sample a minibatch of

examples from the training data and arrange them

as the rows of a design matrix V .

Initialize matrices

(1)

and

(2)

, possibly to the model’s marginals.

while not converged (Mean ﬁeld inference loop) do

(1)

← σ



V W

(1)

(2)

(2)



(2)

← σ



(1)

(2)



end while

∆

(1)

←



(1)

∆

(2)

←

(1) 

(2)

for l = 1 to k (Gibbs sampling) do

Gibbs block 1:

∀i, j,

i,j

sampled from P (

i,j

= 1) = σ



(1)

j,:

(1)

i,:



∀i, j,

(2)

i,j

sampled from P (

(2)

i,j

= 1) = σ



(1)

i,:

(2)

:,j



Gibbs block 2:

∀i, j,

(1)

i,j

sampled from P (

(1)

i,j

= 1) = σ



i,:

(1)

:,j

(2)

i,:

(2)

j,:



end for

∆

(1)

← ∆

(1)

−



(1)

∆

(2)

← ∆

(2)

−

(1)

(2)

(1)

← W

(1)



∆

(1)

(this is a cartoon illustration, in practice use a more

eﬀective algorithm, such as momentum with a decaying learning rate)

(2)

← W

(2)

+ ∆

(2)

end while

677

CHAPTER 20. DEEP GENERATIVE MODELS

order to obtain a better conditioned Hessian matrix.

20.4.4.1 Layer-Wise Pretraining

The original and most popular method for overcoming the joint training problem

of DBMs is greedy layer-wise pretraining. In this method, each layer of the DBM

is trained in isolation as an RBM. The ﬁrst layer is trained to model the input

data. Each subsequent RBM is trained to model samples from the previous RBM’s

posterior distribution. After all of the RBMs have been trained in this way, they

can be combined to form a DBM. The DBM may then be trained with PCD.

Typically PCD training will only make a small change in the model’s parameters

and its performance as measured by the log likelihood it assigns to the data, or its

ability to classify inputs. See Fig. 20.4 for an illustration of the training procedure.

This greedy layer-wise training procedure is not just coordinate ascent. It bears

some passing resemblance to coordinate ascent because we optimize one subset of

the parameters at each step. However, in the case of the greedy layer-wise training

procedure, we actually use a diﬀerent objective function at each step.

Greedy layer-wise pretraining of a DBM diﬀers from greedy layer-wise pre-

training of a DBN. The parameters of each individual RBM may be copied to the

corresponding DBN directly. In the case of the DBM, we must divide the weights

of all but the top and bottom RBM in half before inserting them into the DBM.

Additionally, the bottom RBM must be trained using two “copies” of each visible

unit and the weights tied to be equal between the two copies. This means that the

weights are eﬀectively doubled during the upward pass. Similarly, the top RBM

should be trained with two copies of the topmost layer.

Obtaining the state of the art results with the deep Boltzmann machine requires

a modiﬁcation of the standard SML algorithm, which is to use a small amount of

mean ﬁeld during the negative phase of the joint PCD training step. Speciﬁcally,

the expectation of the energy gradient should be computed with respect to the

mean ﬁeld distribution in which all of the units are independent from each other.

The parameters of this mean ﬁeld distribution should be obtained by running the

mean ﬁeld ﬁxed point equations for just one step. See Goodfellow et al. (2013b)

for a comparison of the performance of centered DBMs with and without the use

of partial mean ﬁeld in the negative phase.

678

CHAPTER 20. DEEP GENERATIVE MODELS

a) b)

Figure 20.4: The deep Boltzmann machine training procedure used to classify the MNIST

dataset (Salakhutdinov and Hinton, 2009a; Srivastava et al., 2014). (a) Train an RBM

by using CD to approximately maximize

logP

(

). (b) Train a second RBM that models

(1)

and target class

by using CD-

to approximately maximize

log P

(

(1)

, y

) where

(1)

is drawn from the ﬁrst RBM’s posterior conditioned on the data. Increase

from 1

to 20 during learning. (c) Combine the two RBMs into a DBM. Train it to approximately

maximize

logP

(

v, y

) using stochastic maximum likelihood with

= 5. (d) Delete

from

the model. Deﬁne a new set of features

(1)

and

(

2) that are obtained by running mean

ﬁeld inference in the model lacking

. Use these features as input to an MLP whose

structure is the same as an additional pass of mean ﬁeld, with an additional output layer

for the estimate of

. Initialize the MLP’s weights to be the same as the DBM’s weights.

Train the MLP to approximately maximize

log P

(

y | v

) using stochastic gradient descent

and dropout. Figure reprinted from (Goodfellow et al., 2013b).

679

CHAPTER 20. DEEP GENERATIVE MODELS

20.4.5 Jointly Training Deep Boltzmann Machines

Classic DBMs require greedy unsupervised pretraining, and to perform classiﬁcation

well, require a separate MLP-based classiﬁer on top of the hidden features they

extract. This has some undesirable properties. It is hard to track performance

during training because we cannot evaluate properties of the full DBM while

training the ﬁrst RBM. Thus, it is hard to tell how well our hyperparameters are

working until quite late in the training process. Software implementations of DBMs

need to have many diﬀerent components for CD training of individual RBMs, PCD

training of the full DBM, and and training based on back-propagation through

the MLP. Finally, the MLP on top of the Boltzmann machine loses many of the

advantages of the Boltzmann machine probabilistic model, such as being able to

perform inference in the presence of missing values.

There are two main ways to resolve the joint training problem of the deep

Boltzmann machine. The ﬁrst is the centered deep Boltzmann machine (Montavon

and Muller, 2012), which reparametrizes the model in order to make the Hessian of

the cost function better-conditioned at the beginning of the learning process. This

yields a model that can be trained without a greedy layer-wise pretraining stage.

The resulting model obtains excellent test set log-likelihood and produces high

quality samples. Unfortunately, it remains unable to compete with appropriately

regularized MLPs as a classiﬁer. The second way to jointly train a deep Boltzmann

machine is to use a multi-prediction deep Boltzmann machine (Goodfellow et al.,

2013b). This model uses an alternative training criterion that allows the use

of the back-propagation algorithm in order to avoid the problems with MCMC

estimates of the gradient. Unfortunately, the new criterion does not lead to good

likelihood or samples, but, compared to the MCMC approach, it does lead to

superior classiﬁcation performance and ability to reason well about missing inputs.

The centering trick for the Boltzmann machine is easiest to describe if we view

the Boltzmann machine as consisting of a set of units

with a weight matrix

and biases b. The energy function is given by

E(s; W , b) = −



W s − s



b. (20.29)

Using diﬀerent sparsity patterns in the weight matrix

, we can implement

structures of Boltzmann machines, such as RBMs, or DBMs with diﬀerent numbers

of layers. This is accomplished by partitioning

into visible and hidden units and

zeroing out elements of

for units that do not interact. The centered Boltzmann

machine introduces a vector µ that is subtracted from all of the states:



(s; W , b) = −

(s − µ)



W (s − µ) − (s − µ)



b. (20.30)

680

CHAPTER 20. DEEP GENERATIVE MODELS

Typically

is a hyperparameter ﬁxed at the beginning of training. It is usu-

ally chosen to make sure that

s − µ ≈ 0

when the model is initialized. This

reparametrization does not change the set of probability distributions that the

model can represent, but it does change the dynamics of stochastic gradient descent

applied to the likelihood. Speciﬁcally, in many cases, this reparametrization results

in a Hessian matrix that is better conditioned. Melchior et al. (2013) experimentally

conﬁrmed that the conditioning of the Hessian matrix improves, and observed that

the centering trick is equivalent to another Boltzmann machine learning technique,

the enhanced gradient (Cho et al., 2011). The improved conditioning of the Hessian

matrix learning to succeed, even in diﬃcult cases like training a deep Boltzmann

machine with multiple layers.

Another approach to jointly training deep Boltzmann machines is the multi-

prediction deep Boltzmann machine (MP-DBM) which works by viewing the mean

ﬁeld equations as deﬁning a family of recurrent networks for approximately solving

every possible inference problem (Goodfellow et al., 2013b). Rather than training

the model to maximize the likelihood, the model is trained to make each recurrent

network obtain an accurate answer to the corresponding inference problem. The

training process is illustrated in Fig. 20.5. It consists of randomly sampling a

training example, randomly sampling a subset of inputs to the inference network,

and then training the inference network to predict the values of the remaining

units.

This general principle of back-propagating through the computational graph

for approximate inference has been applied to other models (Stoyanov et al., 2011;

Brakel et al., 2013). In these models and in the MP-DBM, the ﬁnal loss is not

the lower bound on the likelihood. Instead, the ﬁnal loss is typically based on

the approximate conditional distribution that the approximate inference network

imposes over the missing values. This means that the training of these models

is somewhat heuristically motivated. If we inspect the

(

) represented by the

Boltzmann machine learned by the MP-DBM, it tends to be somewhat defective,

in the sense that Gibbs sampling yields poor samples.

Back-propagation through the inference graph has two main advantages. First,

it trains the model as it is really used—with approximate inference. This means

that approximate inference, for example, to ﬁll in missing inputs, or to perform

classiﬁcation despite the presence of missing inputs, is more accurate in the MP-

DBM than in the original DBM. The original DBM does not make an accurate

classiﬁer on its own; the best classiﬁcation results with the original DBM were

based on training a separate classiﬁer to use features extracted by the DBM,

rather than by using inference in the DBM to compute the distribution over the

681

CHAPTER 20. DEEP GENERATIVE MODELS

class labels. Mean ﬁeld inference in the MP-DBM performs well as a classiﬁer

without special modiﬁcations. The other advantage of back-propagating through

approximate inference is that back-propagation computes the exact gradient of

the loss. This is better for optimization than the approximate gradients of SML

training, which suﬀer from both bias and variance. This probably explains why MP-

DBMs may be trained jointly while DBMs require a greedy layer-wise pretraining.

The disadvantage of back-propagating through the approximate inference graph is

that it does not provide a way to optimize the log-likelihood.

The MP-DBM inspired the NADE-

(Raiko et al., 2014) extension to the

NADE framework, which is described in Sec. 20.11.3.

The MP-DBM has some connections to dropout. Dropout shares the same pa-

rameters among many diﬀerent computational graphs, with the diﬀerence between

each graph being whether it includes or excludes each unit. The MP-DBM also

shares parameters across many computational graphs. In this case, the diﬀerence

between the graphs is whether each input unit is observed or not. When a unit is

not observed, the MP-DBM does not delete it entirely as in the case of dropout.

Instead, the MP-DBM treats it as a latent variable to be inferred. This presum-

ably removes the regularizing eﬀect of dropout. One could imagine applying true

dropout to the MP-DBM by additionally removing some units rather than making

them latent.

20.5 Boltzmann Machines for Real-Valued Data

While Boltzmann machines were originally developed for use with binary data,

many applications such as image and audio modeling seem to require the ability

to represent probability distributions over real values. In some cases, it is possible

to treat real-valued data in the interval [0, 1] as representing the expectation of a

binary variable. For example, Hinton (2000) treats grayscale images in the training

set as deﬁning [0,1] probability values. Each pixel deﬁnes the probability of a

binary value being 1, and the binary pixels are all sampled independently from

each other. This is a common procedure for evaluating binary models on grayscale

image datasets. However, it is not a particularly theoretically satisfying approach,

and binary images sampled independently in this way have a noisy appearance. In

this section, we present Boltzmann machines that deﬁne a probability density over

real-valued data.

682

CHAPTER 20. DEEP GENERATIVE MODELS

Figure 20.5: An illustration of the multi-prediction training process for a deep Boltzmann

machine. Each row indicates a diﬀerent example within a minibatch for the same training

step. Each column represents a time step within the mean ﬁeld inference process. For

example, we sample a subset of the data variables to serve as inputs to the inference

process. These variables are shaded black to indicate conditioning. We then run the

mean ﬁeld inference process, with arrows indicating which variables inﬂuence which other

variables in the process. In practical applications, we unroll mean ﬁeld for several steps.

In this illustration, we unroll for only two steps. Dashed arrows indicate how the process

could be unrolled for more steps. The data variables that were not used as inputs to the

inference process become targets, shaded in gray. We can view the inference process for

each example as a recurrent network. We use gradient descent and back-propagation to

train these recurrent networks to produce the correct targets given their inputs. This

trains the mean ﬁeld process for the MP-DBM to produce accurate estimates. Figure

adapted from Goodfellow et al. (2013b).

683

CHAPTER 20. DEEP GENERATIVE MODELS

20.5.1 Gaussian-Bernoulli RBMs

Restricted Boltzmann machines may be developed for many exponential family

conditional distributions (Welling et al., 2005). Of these, the most common is the

RBM with binary hidden units and real-valued visible units, with the conditional

distribution over the visible units being a Gaussian distribution whose mean is a

function of the hidden units.

There are many ways of parametrizing Gaussian-Bernoulli RBMs. First, we may

choose whether to use a covariance matrix or a precision matrix for the Gaussian

distribution. Here we present the precision formulation. The modiﬁcation to obtain

the covariance formulation is straightforward. We wish to have the conditional

distribution

p(v | h) ∝ N(v; W h, β

−1

). (20.31)

We can ﬁnd the terms we need to add to the energy function by expanding the

unnormalized log conditional distribution:

log N(v; W h, β

−1

) = −

(v −W h)



β (v −W h) + f(β). (20.32)

Here

encapsulates all the terms that are a function only of the parameters

and not the random variables in the model. We can discard

because its only

role is to normalize the distribution, and the partition function of whatever energy

function we choose will carry out that role.

If we include all of the terms (with their sign ﬂipped) involving

from Eq. 20.32

in our energy function and do not add any other terms involving

, then our energy

function will represent the desired conditional p(v | h).

We have some freedom regarding the other conditional distribution,

(

h | v

Note that Eq. 20.32 contains a term



βW h. (20.33)

This term cannot be included in its entirety because it includes

terms. These

correspond to edges between the hidden units. If we included these terms, we

would have a linear factor model instead of a restricted Boltzmann machine.

When designing our Boltzmann machine, we simply omit these

cross terms.

Omitting them does not change the conditional

(

v | h

) so Eq. 20.32 is still

respected. However, we still have a choice about whether to include the terms

involving only a single h

. If we assume a diagonal precision matrix, we ﬁnd that

684

CHAPTER 20. DEEP GENERATIVE MODELS

for each hidden unit h

we have a term



j,i

. (20.34)

In the above, we used the fact that

because

∈ {

}

. If we include this

term (with its sign ﬂipped) in the energy function, then it will naturally bias

to be turned oﬀ when the weights for that unit are large and connected to visible

units with high precision. The choice of whether or not to include this bias term

does not aﬀect the family of distributions the model can represent (assuming that

we include bias parameters for the hidden units) but it does aﬀect the learning

dynamics of the model. Including the term may help the hidden unit activations

remain reasonable even when the weights rapidly increase in magnitude.

One way to deﬁne the energy function on a Gaussian-Bernoulli RBM is thus

E(v, h) =



(β v) − (v  β)



W h − b



h (20.35)

but we may also add extra terms or parametrize the energy in terms of the variance

rather than precision if we choose.

In this derivation, we have not included a bias term on the visible units, but one

could easily be added. One ﬁnal source of variability in the parametrization of a

Gaussian-Bernoulli RBM is the choice of how to treat the precision matrix. It may

either be ﬁxed to a constant (perhaps estimated based on the marginal precision

of the data) or learned. It may also be a scalar times the identity matrix, or it

may be a diagonal matrix. Typically we do not allow the precision matrix to be

non-diagonal in this context, because some operations would then require inverting

the matrix. In the sections ahead, we will see that other forms of Boltzmann

machines permit modeling the covariance structure, using various techniques to

avoid inverting the precision matrix.

20.5.2 Undirected Models of Conditional Covariance

While the Gaussian RBM has been the canonical energy model for real-valued

data, Ranzato et al. (2010a) have argued that the Gaussian RBM inductive bias is

not well suited to the statistical variations present in some types of real-valued

data, especially natural images. The problem is that much of the information

content present in natural images is embedded in the covariance between pixels

rather than in the raw pixel values. In other words, it is the relationships between

pixels and not their absolute values where most of the useful information in images

685

CHAPTER 20. DEEP GENERATIVE MODELS

resides. Since the Gaussian RBM only models the conditional mean of the input

given the hidden units, they cannot capture conditional covariance information. In

response to these criticisms, alternative models have been proposed that attempt

to better account for the covariance of real-valued data. These models include the

mean and covariance RBM (mcRBM

), the mean-product of

-distribution (mPoT)

model and the spike and slab RBM (ssRBM).

Mean and Covariance RBM

The mcRBM uses its hidden units to indepen-

dently encode the conditional mean and covariance of all observed units. The

mcRBM hidden layer is divided into two groups of units: mean units and covariance

units. The group that models the conditional mean is simply a Gaussian RBM.

The other half is a covariance RBM (Ranzato et al., 2010a), also called a cRBM,

whose components model the conditional covariance structure, as described below.

Speciﬁcally, with

mean units

(m)

∈ {

}

and

covariance units

(c)

∈ {

}

, the mcRBM model is deﬁned as the combination of two energy

functions:

(x, h

(m)

, h

(c)

) = E

(x, h

(m)

) + E

(x, h

(c)

), (20.36)

where E

is the standard Gaussian-Bernoulli RBM energy function:

(x, h

(m)

) = −



x −



j=1



:,j

(m)

−



j=1

(m)

, (20.37)

and

is the cRBM energy function that models the conditional covariance

information:

(x, h

(c)

) = −



j=1

(c)





(j)



−



j=1

(c)

The parameter

(j)

corresponds to the covariance weight vector associated with

(c)

and

(c)

is a vector of covariance oﬀsets. The combined energy function deﬁnes

a joint distribution:

(x, h

(m)

, h

(c)

) =

exp



−E

(x, h

(m)

, h

(c)

)



, (20.38)

The term “mcRBM” is pronounced by saying the name of the letters M-C-R-B-M; the “mc”

is not pronounced like the “Mc” in “McDonald’s.”

This version of the Gaussian-Bernoulli RBM energy function assumes the image data has

zero mean, per pixel. Pixel oﬀsets can easily be added to the model to account for nonzero pixel

means.

686

CHAPTER 20. DEEP GENERATIVE MODELS

and a corresponding conditional distribution over the observations given

(m)

and

(c)

as a multivariate Gaussian distribution:

(x | h

(m)

, h

(c)

) = N





x|h







j=1

:,j

(m)





, C

x|h





. (20.39)

Note that the covariance matrix

x|h





j=1

(j)

(j)

+ I



−1

is non-diagonal

and that

is the weight matrix associated with the Gaussian RBM modeling the

conditional means. The conditional distribution over the binary covariance hidden

units h

(c)

is given by:

(c)

= 1 | x) = σ







j=1

(c)





(j)



− b

(c)





It is diﬃcult to train the mcRBM via contrastive divergence or persistent contrastive

divergence because of its non-diagonal conditional covariance structure. CD and

PCD require sampling from the joint distribution of

x, h

(m)

, h

(c)

which, in a

standard RBM, is accomplished by Gibbs sampling over the conditionals. However,

in the mcRBM, sampling from

(

x | h

(m)

, h

(c)

) requires computing (

)

−1

every iteration of learning. This can be an impractical computational burden for

larger observations. Ranzato and Hinton (2010) avoid direct sampling from the

conditional

(

x | h

(m)

, h

(c)

) by sampling directly from the marginal

(

) using

Hamiltonian (hybrid) Monte Carlo (Neal, 1993) on the mcRBM free energy.

Mean-Product of Student’s t-distributions

The mean-product of Student’s

-distribution (mPoT) model (Ranzato et al., 2010b) extends the PoT model (Welling

et al., 2003a) in a manner similar to how the mcRBM extends the cRBM. This

is achieved by including nonzero Gaussian means by the addition of Gaussian

RBM-like hidden units. Like the mcRBM, the PoT conditional distribution over

the observation is a multivariate Gaussian (non-diagonal covariance) distribution;

however, unlike the mcRBM, the complementary conditional distribution over the

hidden variables is given by conditionally independent Gamma distributions. The

Gamma distribution

(

k, θ

) is a probability distribution over positive real numbers,

with mean

kθ

. It is not necessary to have a more detailed understanding of the

Gamma distribution to understand the basic ideas underlying the mPoT model.

687

CHAPTER 20. DEEP GENERATIVE MODELS

The mPoT energy function is:

mPoT

(x, h

(m)

, h

(c)

)

= E

(x, h

(m)

) +





(c)



1 +



(j)





+ (1 − γ

) log h

(c)



where

(j)

is the covariance weight vector associated with unit

(c)

and

(

x, h

(m)

)

is as deﬁned in Eq. 20.37.

Just as with the mcRBM, the mPoT model energy function speciﬁes a mul-

tivariate Gaussian, with a conditional distribution over

that has non-diagonal

covariance. The covariance units h

(c)

are conditionally Gamma-distributed:

mPoT

(c)

| x) = G



, 1 +



(j)





(20.40)

Learning in the mPoT model—again, like the mcRBM—is complicated by the in-

ability to sample from the non-diagonal Gaussian conditional

mPoT

(

x | h

(m)

, h

(c)

So Ranzato et al. (2010b) also advocate direct sampling of

(

) via Hamiltonian

(hybrid) Monte Carlo.

Spike and Slab Restricted Boltzmann Machines

Spike and slab restricted

Boltzmann machines (Courville et al., 2011) or ssRBMs provide another means

of modeling the covariance structure of real-valued data. Compared to mcRBMs,

ssRBMs have the advantage of requiring neither matrix inversion nor Hamiltonian

Monte Carlo methods. As a model of natural images, the ssRBM is interesting

in that, like the mcRBM and the mPoT model, its binary hidden units encode

the conditional covariance across pixels through the use of auxiliary real-valued

variables.

The spike and slab RBM has two sets of hidden units: binary spike units

and real-valued slab units

. The mean of the visible units conditioned on the

hidden units is given by (

h  s

)



. In other words, each column

:,i

deﬁnes a

component that can appear in the input when

= 1. The corresponding spike

variable

determines whether that component is present at all. The corresponding

slab variable

determines the intensity of that component, if it is present. When

a spike variable is active, the corresponding slab variable adds variance to the

input along the axis deﬁned by

:,i

. This allows us to model the covariance of the

inputs. Fortunately, contrastive divergence and persistent contrastive divergence

with Gibbs sampling are still applicable. There is no need to invert any matrix.

688

CHAPTER 20. DEEP GENERATIVE MODELS

Formally, the ssRBM model is deﬁned via its energy function:

(x, s, h) = −



i=1



:,i





Λ +



i=1



x (20.41)



i=1

−



i=1

−



i=1



i=1

, (20.42)

where

is the oﬀset of the spike

and

is a diagonal precision matrix on the

observations

. Other model parameters:

0 is a scalar precision parameter for

the real-valued slab variable

;

is a non-negative diagonal matrix that deﬁnes

-modulated quadratic penalty on

; and each

is a mean parameter for the

slab variable s

With the joint distribution deﬁned via the energy function, it is relatively

straightforward to derive the ssRBM conditional distributions. For example,

by marginalizing out the slab variables

, the conditional distribution over the

observations given the binary spike variables h is given by:

(x | h) =

P (h)



exp {−E(x, s, h)} ds

= N



x|h



i=1

:,i

, C

x|h



(20.43)

where

x|h



Λ +



i=1

−



i=1

−1

:,i



:,i



−1

. The last equality holds

only if the covariance matrix C

x|h

is positive deﬁnite.

Gating by the spike variables means that the true marginal distribution over

h s

is sparse. This is diﬀerent from sparse coding, where samples from the model

“almost never” (in the measure theoretic sense) contain zeros in the code, and MAP

inference is required to impose sparsity.

Comparing the ssRBM to the mcRBM and the mPoT models, the ssRBM

parametrizes the conditional covariance of the observation in a signiﬁcantly diﬀerent

way. The mcRBM and mPoT both model the covariance structure of the observation





(c)

j=1

(c)

(j)

(j)

+ I



−1

, using the activation of the hidden units

0 to

enforce constraints on the conditional covariance in the direction

(j)

. In contrast,

the ssRBM speciﬁes the conditional covariance of the observations using the hidden

spike activations

= 1 to pinch the precision matrix along the direction speciﬁed

by the corresponding weight vector. The ssRBM conditional covariance is very

similar to that given by a diﬀerent model: the product of probabilistic principal

689

CHAPTER 20. DEEP GENERATIVE MODELS

components analysis (PoPPCA) (Williams and Agakov, 2002). In the overcomplete

setting, sparse activations with the ssRBM parametrization permits signiﬁcant

variance (above the nominal variance given by

−1

) only in the selected directions

of the sparsely activated

. In the mcRBM or mPoT models, an overcomplete

representation would mean that to capture variation in a particular direction in

the observation space requires removing potentially all constraints with positive

projection in that direction. This would suggest that these models are less well

suited to the overcomplete setting.

The primary disadvantage of the spike and slab restricted Boltzmann machine

is that some settings of the parameters can correspond to a covariance matrix

that is not positive deﬁnite. Such a covariance matrix places more unnormalized

probability on values that are farther from the mean, causing the integral over

all possible outcomes to diverge. Generally this issue can be avoided with simple

heuristic tricks. There is not yet any theoretically satisfying solution. Using

constrained optimization to explicitly avoid the regions where the probability is

undeﬁned is diﬃcult to do without being overly conservative and also preventing

the model from accessing high-performing regions of parameter space.

Qualitatively, convolutional variants of the ssRBM produce excellent samples

of natural images. Some examples are shown in Fig. 16.1.

The ssRBM allows for several extensions. Including higher-order interactions

and average-pooling of the slab variables (Courville et al., 2014) enables the model

to learn excellent features for a classiﬁer when labeled data is scarce. Adding a

term to the energy function that prevents the partition function from becoming

undeﬁned results in a sparse coding model, spike and slab sparse coding (Goodfellow

et al., 2013d), also known as S3C.

20.6 Convolutional Boltzmann Machines

As seen in Chapter 9, extremely high dimensional inputs such as images place

great strain on the computation, memory and statistical requirements of machine

learning models. Replacing matrix multiplication by discrete convolution with a

small kernel is the standard way of solving these problems for inputs that have

translation invariant spatial or temporal structure. Desjardins and Bengio (2008)

showed that this approach works well when applied to RBMs.

Deep convolutional networks usually require a pooling operation so that the

spatial size of each successive layer decreases. Feedforward convolutional networks

often use a pooling function such as the maximum of the elements to be pooled.

690

CHAPTER 20. DEEP GENERATIVE MODELS

It is unclear how to generalize this to the setting of energy-based models. We

could introduce a binary pooling unit

over

binary detector units

and enforce

max

by setting the energy function to be

∞

whenever that constraint is

violated. This does not scale well though, as it requires evaluating 2

diﬀerent

energy conﬁgurations to compute the normalization constant. For a small 3

pooling region this requires 2

= 512 energy function evaluations per pooling unit!

Lee et al. (2009) developed a solution to this problem called probabilistic max

pooling (not to be confused with “stochastic pooling,” which is a technique for

implicitly constructing ensembles of convolutional feedforward networks). The

strategy behind probabilistic max pooling is to constrain the detector units so

at most one may be active at a time. This means there are only

+ 1 total

states (one state for each of the

detector units being on, and an additional state

corresponding to all of the detector units being oﬀ). The pooling unit is on if

and only if one of the detector units is on. The state with all units oﬀ is assigned

energy zero. We can think of this as describing a model with a single variable that

has

+ 1 states, or equivalently as a model that has

+ 1 variables that assigns

energy ∞ to all but n + 1 joint assignments of variables.

While eﬃcient, probabilistic max pooling does force the detector units to be

mutually exclusive, which may be a useful regularizing constraint in some contexts

or a harmful limit on model capacity in other contexts. It also does not support

overlapping pooling regions. Overlapping pooling regions are usually required

to obtain the best performance from feedforward convolutional networks, so this

constraint probably greatly reduces the performance of convolutional Boltzmann

machines.

Lee et al. (2009) demonstrated that probabilistic max pooling could be used

to build convolutional deep Boltzmann machines.

This model is able to perform

operations such as ﬁlling in missing portions of its input. While intellectually

appealing, this model is challenging to make work in practice, and usually does

not perform as well as a classiﬁer as traditional convolutional networks trained

with supervised learning.

Many convolutional models work equally well with inputs of many diﬀerent

spatial sizes. For Boltzmann machines, it is diﬃcult to change the input size

for a variety of reasons. The partition function changes as the size of the input

changes. Moreover, many convolutional networks achieve size invariance by scaling

up the size of their pooling regions proportional to the size of the input, but scaling

The publication describes the model as a “deep belief network” but because it can be described

as a purely undirected model with tractable layer-wise mean ﬁeld ﬁxed point updates, it best ﬁts

the deﬁnition of a deep Boltzmann machine.

691

CHAPTER 20. DEEP GENERATIVE MODELS

Boltzmann machine pooling regions is awkward. Traditional convolutional neural

networks can use a ﬁxed number of pooling units and dynamically increase the

size of their pooling regions in order to obtain a ﬁxed-size representation of a

variable-sized input. For Boltzmann machines, large pooling regions become too

expensive for the naive approach. The approach of Lee et al. (2009) of making

each of the detector units in the same pooling region mutually exclusive solves

the computational problems, but still does not allow variable-size pooling regions.

For example, suppose we learn a model with 2

2 probabilistic max pooling over

detector units that learn edge detectors. This enforces the constraint that only

one of these edges may appear in each 2

2 region. If we then increase the size of

the input image by 50% in each direction, we would expect the number of edges to

increase correspondingly. Instead, if we increase the size of the pooling regions by

50% in each direction to 3

3, then the mutual exclusivity constraint now speciﬁes

that each of these edges may only appear once in a 3

3 region. As we grow

a model’s input image in this way, the model generates edges with less density.

Of course, these issues only arise when the model must use variable amounts of

pooling in order to emit a ﬁxed-size output vector. Models that use probabilistic

max pooling may still accept variable-sized input images so long as the output of

the model is a feature map that can scale in size proportional to the input image.

Pixels at the boundary of the image also pose some diﬃculty, which is exac-

erbated by the fact that connections in a Boltzmann machine are symmetric. If

we do not implicitly zero-pad the input, then there are fewer hidden units than

visible units, and the visible units at the boundary of the image are not modeled

well because they lie in the receptive ﬁeld of fewer hidden units. However, if we do

implicitly zero-pad the input, then the hidden units at the boundary are driven by

fewer input pixels, and may fail to activate when needed.

20.7 Boltzmann Machines for Structured or Sequential

Outputs

In the structured output scenario, we wish to train a model that can map from

some input

to some output

, and the diﬀerent entries of

are related to each

other and must obey some constraints. For example, in the speech synthesis task,

y is a waveform, and the entire waveform must sound like a coherent utterance.

A natural way to represent the relationships between the entries in

is to

use a probability distribution

(

y | x

). Boltzmann machines, extended to model

conditional distributions, can supply this probabilistic model.

692

CHAPTER 20. DEEP GENERATIVE MODELS

The same tool of conditional modeling with a Boltzmann machine can be used

not just for structured output tasks, but also for sequence modeling. In the latter

case, rather than mapping an input

to an output

, the model must estimate a

probability distribution over a sequence of variables,

(

(1)

, . . . , x

(τ)

). Conditional

Boltzmann machines can represent factors of the form

(

(t)

| x

(1)

, . . . , x

(t−1)

) in

order to accomplish this task.

An important sequence modeling task for the video game and ﬁlm industry

is modeling sequences of joint angles of skeletons used to render 3-D characters.

These sequences are often collected using motion capture systems to record the

movements of actors. A probabilistic model of a character’s movement allows

the generation of new, previously unseen, but realistic animations. To solve

this sequence modeling task, Taylor et al. (2007) introduced a conditional RBM

modeling

(

(t)

| x

(t−1)

, . . . , x

(t−m)

) for small

. The model is an RBM over

(

(t)

) whose bias parameters are a linear function of the preceding

values of

When we condition on diﬀerent values of

(t−1)

and earlier variables, we get a new

RBM over

. The weights in the RBM over

never change, but by conditioning on

diﬀerent past values, we can change the probability of diﬀerent hidden units in the

RBM being active. By activating and deactivating diﬀerent subsets of hidden units,

we can make large changes to the probability distribution induced on

. Other

variants of conditional RBM (Mnih et al., 2011) and other variants of sequence

modeling using conditional RBMs are possible (Taylor and Hinton, 2009; Sutskever

et al., 2009; Boulanger-Lewandowski et al., 2012).

Another sequence modeling task is to model the distribution over sequences

of musical notes used to compose songs. Boulanger-Lewandowski et al. (2012)

introduced the RNN-RBM sequence model and applied it to this task. The RNN-

RBM is a generative model of a sequence of frames

(t)

consisting of an RNN

that emits the RBM parameters for each time step. Unlike the model described

above, the RNN emits all of the parameters, of the RBM, including the weights.

To train the model, we need to be able to back-propagate the gradient of the

loss function through the RNN. The loss function is not applied directly to the

RNN outputs. Instead, it is applied to the RBM. This means that we must

approximately diﬀerentiate the loss with respect to the RBM parameters using

contrastive divergence or a related algorithm. This approximate gradient may then

be back-propagated through the RNN using the usual back-propagation through

time algorithm.

693

CHAPTER 20. DEEP GENERATIVE MODELS

20.8 Other Boltzmann Machines

Many other variants of Boltzmann machines are possible.

Boltzmann machines may be extended with diﬀerent training criteria. We have

focused on Boltzmann machines trained to approximately maximize the generative

criterion

log p

(

). It is also possible to train discriminative RBMs that aim to

maximize

log p

(

y | v

) instead (Larochelle and Bengio, 2008). This approach often

performs the best when using a linear combination of both the generative and

the discriminative criteria. Unfortunately, RBMs do not seem to be as powerful

supervised learners as MLPs, at least using existing methodology.

Most Boltzmann machines used in practice have only second-order interactions

in their energy functions, meaning that their energy functions are the sum of many

terms and each individual term only includes the product between two random

variables. An example of such a term is

i,j

. It is also possible to train

higher-order Boltzmann machines (Sejnowski, 1987) whose energy function terms

involve the products between many variables. Three-way interactions between a

hidden unit and two diﬀerent images can model spatial transformations from one

frame of video to the next (Memisevic and Hinton, 2007, 2010). Multiplication by a

one-hot class variable can change the relationship between visible and hidden units

depending on which class is present (Nair and Hinton, 2009). One recent example

of the use of higher-order interactions is a Boltzmann machine with two groups of

hidden units, with one group of hidden units that interact with both the visible

units

and the class label

, and another group of hidden units that interact only

with the

input values (Luo et al., 2011). This can be interpreted as encouraging

some hidden units to learn to model the input using features that are relevant to

the class but also to learn extra hidden units that explain nuisance details that

are necessary for the samples of

to be realistic but do not determine the class

of the example. Another use of higher-order interactions is to gate some features.

Sohn et al. (2013) introduced a Boltzmann machine with third-order interactions

with binary mask variables associated with each visible unit. When these masking

variables are set to zero, they remove the inﬂuence of a visible unit on the hidden

units. This allows visible units that are not relevant to the classiﬁcation problem

to be removed from the inference pathway that estimates the class.

More generally, the Boltzmann machine framework is a rich space of models

permitting many more model structures than have been explored so far. Developing

a new form of Boltzmann machine requires some more care and creativity than

developing a new neural network layer, because it is often diﬃcult to ﬁnd an energy

function that maintains tractability of all of the diﬀerent conditional distributions

694

CHAPTER 20. DEEP GENERATIVE MODELS

needed to use the Boltzmann machine, but despite this required eﬀort the ﬁeld

remains open to innovation.

20.9 Back-Propagation through Random Operations

Traditional neural networks implement a deterministic transformation of some

input variables

. When developing generative models, we often wish to extend

neural networks to implement stochastic transformations of

. One straightforward

way to do this is to augment the neural network with extra inputs

that are

sampled from some simple probability distribution, such as a uniform or Gaussian

distribution. The neural network can then continue to perform deterministic

computation internally, but the function

(

x, z

) will appear stochastic to an

observer who does not have access to

. Provided that

is continuous and

diﬀerentiable, we can then compute the gradients necessary for training using

back-propagation as usual.

As an example, let us consider the operation consisting of drawing samples

from a Gaussian distribution with mean µ and variance σ

y ∼ N(µ, σ

). (20.44)

Because an individual sample of

is not produced by a function, but rather by

a sampling process whose output changes every time we query it, it may seem

counterintuitive to take the derivatives of

with respect to the parameters of

its distribution,

and

. However, we can rewrite the sampling process as

transforming an underlying random value

z ∼ N

(

; 0

1) to obtain a sample from

the desired distribution:

y = µ + σz (20.45)

We are now able to back-propagate through the sampling operation, by regard-

ing it as a deterministic operation with an extra input

. Crucially, the extra input

is a random variable whose distribution is not a function of any of the variables

whose derivatives we want to calculate. The result tells us how an inﬁnitesimal

change in

would change the output if we could repeat the sampling operation

again with the same value of z.

Being able to back-propagate through this sampling operation allows us to

incorporate it into a larger graph. We can build elements of the graph on top of the

output of the sampling distribution. For example, we can compute the derivatives

of some loss function

(

). We can also build elements of the graph whose outputs

are the inputs or the parameters of the sampling operation. For example, we could

695

CHAPTER 20. DEEP GENERATIVE MODELS

build a larger graph with

(

;

) and

(

;

). In this augmented graph,

we can use back-propagation through these functions to derive ∇

J(y).

The principle used in this Gaussian sampling example is true in general. We can

express any probability distribution of the form

(

;

) or

(

y | x

;

) as

(

y | ω

where ω is a variable containing both parameters θ, and if applicable, the inputs

. Given a value

sampled from distribution

(

y | ω

), where

may in turn be a

function of other variables, we can rewrite

y ∼ p(y | ω) (20.46)

y = f(z; ω), (20.47)

where

is a source of randomness. Crucially,

must not be a function of

, and

must not be a function of

. This is often called the reparametrization trick,

stochastic back-propagation or perturbation analysis.

Gradient-based optimization can then be applied when

is continuous and

diﬀerentiable almost everywhere. This of course requires

to be continuous. If we

wish to back-propagate through a sampling process that produces discrete-valued

samples, it may still be possible to estimate a gradient on

, using reinforcement

learning algorithms such as variants of the REINFORCE (REward Increment =

Non-negative Factor

Oﬀset Reinforcement

Characteristic Eligibility) algo-

rithm (Williams, 1992), discussed below.

In neural network applications, we typically choose

to be drawn from some

simple distribution, such as a unit uniform or unit Gaussian distribution, and

achieve more complex distributions by allowing the deterministic portion of the

network to reshape its input. This is actually how the random generators for

parametric distributions are implemented in software, by performing operations on

approximately independent sources of noise (such as random bits).

The idea of propagating gradients or optimizing through stochastic operations

dates back to the mid-twentieth century (Price, 1958; Bonnet, 1964) and was

ﬁrst used for machine learning in the context of reinforcement learning (Williams,

1992). More recently, it has been applied to variational approximations (Opper

and Archambeau, 2009) and stochastic or generative neural networks (Bengio

et al., 2013b; Kingma, 2013; Kingma and Welling, 2014b,a; Rezende et al., 2014;

Goodfellow et al., 2014c). Many networks, such as denoising autoencoders or

networks regularized with dropout, are also naturally designed to take noise

as an input without requiring any special reparametrization to make the noise

independent from the model.

696

CHAPTER 20. DEEP GENERATIVE MODELS

20.9.1 Back-Propagating through Discrete Stochastic Operations

When a model emits a discrete variable

, the reparametrization trick is not

applicable. Suppose that the model takes inputs

and parameters

, both

encapsulated in the vector

, and combines them with random noise

to produce

y = f(z; ω). (20.48)

Because

is discrete,

must be a step function. The derivatives of a step function

are not useful at any point. Right at each step boundary, the derivatives are

undeﬁned, but that is a small problem. The large problem is that the derivatives

are zero almost everywhere, on the regions between step boundaries. The derivatives

of any cost function

(

) therefore do not give any information for how to update

the model parameters θ.

The REINFORCE algorithm (Williams, 1992) provides a framework deﬁning a

family of simple but powerful solutions. The core idea is that even though

(

;

))

is a step function with useless derivatives, the expected cost

z∼p(z)

J (f(z; ω))

often a smooth function amenable to gradient descent. Although that expectation

is typically not tractable when

is high-dimensional (or is the result of the

composition of many discrete stochastic decisions), it can be estimated without

bias using a Monte Carlo average. The stochastic estimate of the gradient can be

used with SGD or other stochastic gradient-based optimization techniques.

The simplest version of REINFORCE can be derived by simply diﬀerentiating

the expected cost:

[J(y)] =



J(y)p(y)

∂E[J(y)]

∂ω



J(y)

∂p(y)

∂ω

(20.49)



J(y)p(y)

∂ log p(y)

∂ω

(20.50)

≈



(i)

∼p(y), i=1

J(y

(i)

)

∂ log p(y

(i)

)

∂ω

. (20.51)

Eq. 20.49 relies on the assumption that

does not reference

directly. It is trivial

to extend the approach to relax this assumption. Eq. 20.50 exploits the derivative

rule for the logarithm,

∂ log p(y)

∂ω

p(y)

∂p(y)

∂ω

. Eq. 20.51 gives an unbiased Monte

Carlo estimator of the gradient.

697

CHAPTER 20. DEEP GENERATIVE MODELS

Anywhere we write

(

) in this section, one could equally write

(

y | x

). This

is because

(

) is parametrized by

, and

contains both

and

, if

is present.

One issue with the above simple REINFORCE estimator is that it has a very

high variance, so that many samples of

need to be drawn to obtain a good

estimator of the gradient, or equivalently, if only one sample is drawn, SGD will

converge very slowly and will require a smaller learning rate. It is possible to

considerably reduce the variance of that estimator by using variance reduction

methods (Wilson, 1984; L’Ecuyer, 1994). The idea is to modify the estimator so

that its expected value remains unchanged but its variance get reduced. In the

context of REINFORCE, the proposed variance reduction methods involve the

computation of a baseline that is used to oﬀset

(

). Note that any oﬀset

(

)

that does not depend on

would not change the expectation of the estimated

gradient because

p(y)



∂ log p(y)

∂ω





p(y)

∂ log p(y)

∂ω



∂p(y)

∂ω

∂

∂ω



p(y) =

∂

∂ω

1 = 0, (20.52)

which means that

p(y)



(J(y) − b(ω))

∂ log p(y)

∂ω



= E

p(y)



J(y)

∂ log p(y)

∂ω



− b(ω)E

p(y)



∂ log p(y)

∂ω



= E

p(y)



J(y)

∂ log p(y)

∂ω



. (20.53)

Furthermore, we can obtain the optimal

(

) by computing the variance of (

(

)

−

(

))

∂ log p(y)

∂ω

under

(

) and minimizing with respect to

(

). What we ﬁnd is

that this optimal baseline b

∗

(ω)

is diﬀerent for each element ω

of the vector ω:

∗

(ω)

p(y)



J(y)

∂ log p(y)

∂ω



p(y)



∂ log p(y)

∂ω



. (20.54)

The gradient estimator with respect to ω

then becomes

(J(y) − b(ω)

)

∂ log p(y)

∂ω

(20.55)

698

CHAPTER 20. DEEP GENERATIVE MODELS

where

(

)

estimates the above

∗

(

)

. The estimate

is usually obtained by

adding extra outputs to the neural network and training the new outputs to estimate

p(y)

[

(

)

∂ log p(y)

∂ω

] and

p(y)



∂ log p(y)

∂ω



for each element of

. These extra

outputs can be trained with the mean squared error objective, using respectively

(

)

∂ log p(y)

∂ω

and

∂ log p(y)

∂ω

as targets when

is sampled from

(

), for a given

. The estimate

may then be recovered by substituting these estimates into Eq.

20.54. Mnih and Gregor (2014) preferred to use a single shared output (across all

elements

) trained with the target

(

), using as baseline

(

)

≈ E

p(y)

[

(

)].

Variance reduction methods have been introduced in the reinforcement learning

context (Sutton et al., 2000; Weaver and Tao, 2001), generalizing previous work

on the case of binary reward by Dayan (1990). See Bengio et al. (2013b), Mnih

and Gregor (2014), Ba et al. (2014), Mnih et al. (2014), or Xu et al. (2015a) for

examples of modern uses of the REINFORCE algorithm with reduced variance in

the context of deep learning. In addition to the use of an input-dependent baseline

(

), Mnih and Gregor (2014) found that the scale of (

(

)

− b

(

)) could be

adjusted during training by dividing it by its standard deviation estimated by a

moving average during training, as a kind of adaptive learning rate, to counter

the eﬀect of important variations that occur during the course of training in the

magnitude of this quantity. Mnih and Gregor (2014) called this heuristic variance

normalization.

Once we have estimated the gradient of the expected loss with respect to

we can back-propagate it as usual in the upstream parts of the computational

graph that lead to

, in order to obtain an estimated gradient over the variables

of interest (typically parameters of the model). However, REINFORCE-based

estimators remain fairly noisy and can be understood as estimating the gradient by

correlating choices of

with corresponding values of

(

). If a good value of

unlikely under the current parametrization, it might take a long time to obtain it

by chance, and get the required signal that this conﬁguration should be reinforced.

20.10 Directed Generative Nets

So far in this chapter we have focused on undirected generative models. In the

deep learning context these are almost always parametrized via an energy function

E and possess an intractable partition function Z.

As discussed in Chapter 16, directed graphical models make up a second

prominent class of graphical models. While directed graphical models have been

very popular within the greater machine learning community, within the smaller

699

CHAPTER 20. DEEP GENERATIVE MODELS

deep learning community they have until recently been overshadowed by undirected

models such as the RBM.

In this section we review some of the standard directed graphical models that

have traditionally been associated with the deep learning community.

We have already described deep belief networks, which are a partially directed

model. Sparse coding models are also often used as feature learners in the context

of deep learning, though they tend to make poor generative models.

20.10.1 Sigmoid Belief Nets

Sigmoid belief networks (Neal, 1990) are a simple form of directed graphical model

with a speciﬁc kind of conditional probability distribution. In general, we can

think of a sigmoid belief network as having a vector of binary states

, with each

element of the state inﬂuenced by its ancestors:

p(s

) = σ







j<i

j,i

+ b





. (20.56)

The most common structure of sigmoid belief network is one that is divided

into many layers, with ancestral sampling proceeding through a series of many

hidden layers and then ultimately generating the visible layer. This structure is

very similar to the deep belief network, except that the units at the beginning of

the sampling process are independent from each other, rather than sampled from

a restricted Boltzmann machine. Such a structure is interesting for a variety of

reasons. One reason is that the structure is a universal approximator of probability

distributions over the visible units, in the sense that it can approximate any

probability distribution over binary variables arbitrarily well, given enough depth,

even if the width of the individual layers is restricted to the dimensionality of the

visible layer (Sutskever and Hinton, 2008).

While generating a sample of the visible units is very eﬃcient in a sigmoid

belief network, most other operations are not. Inference over the hidden units given

the visible units is intractable. Mean ﬁeld inference is also intractable because the

variational lower bound involves taking expectations of cliques that encompass

entire layers. This problem has remained diﬃcult enough to restrict the popularity

of directed discrete networks.

One approach for performing inference in a sigmoid belief network is to construct

a diﬀerent lower bound that is specialized for sigmoid belief networks (Saul et al.,

1996). This approach has only been applied to very small networks. Another

700

CHAPTER 20. DEEP GENERATIVE MODELS

approach is to use learned inference mechanisms as described in Sec. 19.5. The

Helmholtz machine (Dayan et al., 1995; Dayan and Hinton, 1996) is a sigmoid belief

network combined with an inference network that predicts the parameters of the

mean ﬁeld distribution over the hidden units. Modern approaches (Gregor et al.,

2014; Mnih and Gregor, 2014) to sigmoid belief networks still use this inference

network approach. These techniques remain diﬃcult due to the discrete nature of

the latent variables. One cannot simply back-propagate through the output of the

inference network, but instead must use the relatively unreliable machinery for back-

propagating through discrete sampling processes, described in Sec. 20.9.1. Recent

approaches based on importance sampling, reweighted wake-sleep (Bornschein

and Bengio, 2015) and bidirectional Helmholtz machines (Bornschein et al., 2015)

make it possible to quickly train sigmoid belief networks and reach state-of-the-art

performance on benchmark tasks.

A special case of sigmoid belief networks is the case where there are no latent

variables. Learning in this case is eﬃcient, because there is no need to marginalize

latent variables out of the likelihood. A family of models called auto-regressive

networks generalize this fully visible belief network to other kinds of variables

besides binary variables and other structures of conditional distributions besides

log-linear relationships. Auto-regressive networks are described later, in Sec. 20.11.

20.10.2 Diﬀerentiable Generator Nets

Many generative models are based on the idea of using a diﬀerentiable generator

network. The model transforms samples of latent variables

to samples

to distributions over samples

using a diﬀerentiable function

(

;

(g)

) which is

typically represented by a neural network. This model class includes variational

autoencoders, which pair the generator net with an inference net, generative

adversarial networks, which pair the generator network with a discriminator

network, and techniques that train generator networks directly.

Generator networks are essentially just parametrized computational procedures

for generating samples, where the architecture provides the family of possible

distributions to sample from and the parameters select a distribution from within

that family. As such, generator networks are a generalization of the existing,

manually designed pseudorandom number generation procedures already used to

draw numbers from various simple distributions.

As an example, the standard procedure for drawing samples from a normal

distribution with mean

and covariance

is to feed samples

from a normal

distribution with zero mean and identity covariance into a very simple generator

701

CHAPTER 20. DEEP GENERATIVE MODELS

network. This generator network contains just one aﬃne layer:

x = g(z) = µ + Lz (20.57)

where L is given by the Cholesky decomposition of Σ.

Pseudorandom number generators can also use nonlinear transformations of

simple distributions. For example, inverse transform sampling (Devroye, 2013)

draws a scalar

from

1) and applies a nonlinear transformation to a scalar

. In this case

(

) is given by the inverse of the cumulative distribution function

(

) =



−∞

(

)

. If we are able to specify

(

), integrate over

, and invert the

resulting function, we can sample from p(x) without using machine learning.

To generate samples from more complicated distributions that are diﬃcult

to specify directly, diﬃcult to integrate over, or whose resulting integrals are

diﬃcult to invert, we use a feedforward network to represent a parametric family

of nonlinear functions

, and use training data to infer the parameters selecting

the desired function.

We can think of

as providing a nonlinear change of variables that transforms

the distribution over z into the desired distribution over x.

Recall from Eq. 3.47 that, for invertible, diﬀerentiable, continuous g,

(z) = p

(g(z))



det(

∂g

∂z

)



. (20.58)

This implicitly imposes a probability distribution over x:

(x) =

−1

(x))



det(

∂g

∂z

)



. (20.59)

Of course, this formula may be diﬃcult to evaluate, depending on the choice of

, so we often use indirect means of learning

, rather than trying to maximize

log p(x) directly.

In some cases, rather than using

to provide a sample of

directly, we use

to deﬁne a conditional distribution over

. For example, we could use a generator

net whose ﬁnal layer consists of sigmoid outputs to provide the mean parameters

of Bernoulli distributions:

p(x

= 1 | z) = g(z)

. (20.60)

In this case, when we use

to deﬁne

(

x | z

), we impose a distribution over

marginalizing z:

p(x) = E

p(x | z). (20.61)

702

CHAPTER 20. DEEP GENERATIVE MODELS

Both approaches deﬁne a distribution

(

) and allow us to train various

criteria of p

using the reparametrization trick of Sec. 20.9.

The two diﬀerent approaches to formulating generator nets—emitting the

parameters of a conditional distribution versus directly emitting samples—have

complementary strengths and weaknesses. When the generator net deﬁnes a

conditional distribution over

, it is capable of generating discrete data as well

as continuous data. When the generator net provides samples directly, it is only

capable of generating continuous data (we could introduce discretization in the

forward propagation, but this would lose the ability to learn the model using

back-propagation). The advantage to direct sampling is that we are no longer

forced to use conditional distributions whose form can be easily written down and

algebraically manipulated by a human designer.

Approaches based on diﬀerentiable generator networks are motivated by the

success of gradient descent applied to diﬀerentiable feedforward networks for

classiﬁcation. In the context of supervised learning, deep feedforward networks

trained with gradient-based learning seem practically guaranteed to succeed given

enough hidden units and enough training data. Can this same recipe for success

transfer to generative modeling?

Generative modeling seems to be more diﬃcult than classiﬁcation or regression

because the learning process requires optimizing intractable criteria. In the context

of diﬀerentiable generator nets, the criteria are intractable because the data does

not specify both the inputs

and the outputs

of the generator net. In the case

of supervised learning, both the inputs

and the outputs

were given, and the

optimization procedure needs only to learn how to produce the speciﬁed mapping.

In the case of generative modeling, the learning procedure needs to determine how

to arrange z space in a useful way and additionally how to map from z to x.

Dosovitskiy et al. (2015) studied a simpliﬁed problem, where the correspondence

between

and

is given. Speciﬁcally, the training data is computer-rendered

imagery of chairs. The latent variables

are parameters given to the rendering

engine describing the choice of which chair model to use, the position of the chair,

and other conﬁguration details that aﬀect the rendering of the image. Using this

synthetically generated data, a convolutional network is able to learn to map

descriptions of the content of an image to

approximations of rendered images.

This suggests that contemporary diﬀerentiable generator networks have suﬃcient

model capacity to be good generative models, and that contemporary optimization

algorithms have the ability to ﬁt them. The diﬃculty lies in determining how to

train generator networks when the value of

for each

is not ﬁxed and known

ahead of each time.

703

CHAPTER 20. DEEP GENERATIVE MODELS

The following sections describe several approaches to training diﬀerentiable

generator nets given only training samples of x.

20.10.3 Variational Autoencoders

The variational autoencoder or VAE (Kingma, 2013; Rezende et al., 2014) is a

directed model that uses learned approximate inference and can be trained purely

with gradient-based methods.

To generate a sample from the model, the VAE ﬁrst draws a sample

from

the code distribution

model

(

). The sample is then run through a diﬀerentiable

generator network

(

). Finally,

is sampled from a distribution

model

(

;

(

)) =

model

(

x | z

). However, during training, the approximate inference network (or

encoder)

(

z | x

) is used to obtain

and

model

(

x | z

) is then viewed as a decoder

network.

The key insight behind variational autoencoders is that they may be trained

by maximizing the variational lower bound L(q) associated with data point x:

L(q) = E

z∼q(z|x)

log p

model

(z, x) + H(q(z | x)) (20.62)

= E

z∼q(z|x)

log p

model

(x | z) − D

(q(z | x)||p

model

(z)) (20.63)

≤ log p

model

(x). (20.64)

In Eq. 20.62, we recognize the ﬁrst term as the joint log-likelihood of the visible

and hidden variables under the approximate posterior over the latent variables (just

like with EM, except that we use an approximate rather than the exact posterior).

We recognize also a second term, the entropy of the approximate posterior. When

is chosen to be a Gaussian distribution, with noise added to a predicted mean

value, maximizing this entropy term encourages increasing the standard deviation

of this noise. More generally, this entropy term encourages the variational posterior

to place high probability mass on many

values that could have generated

rather than collapsing to a single point estimate of the most likely value. In Eq.

20.63, we recognize the ﬁrst term as the reconstruction log-likelihood found in

other autoencoders, as well as a term that tries to make the approximate posterior

distribution q(z | x) and the model prior p

model

(z) approach each other.

Traditional approaches to variational inference and learning infer

via an

optimization algorithm, typically iterated ﬁxed point equations (Sec. 19.4). These

approaches are slow and often require the ability to compute

z∼q

log p

model

(

z, x

)

in closed form. The main idea behind the variational autoencoder is to train a

parametric encoder (also sometimes called an inference network or recognition

model) that produces the parameters of

. So long as

is a continuous variable, we

704

CHAPTER 20. DEEP GENERATIVE MODELS

can then back-propagate through samples of

drawn from

(

z | x

) =

(

;

(

;

))

in order to obtain a gradient with respect to

. Learning then consists solely of

maximizing

with respect to the parameters of the encoder and decoder. All of

the expectations in L may be approximated by Monte Carlo sampling.

In addition to the reconstruction error found in most autoencoders, the vari-

ational bound also includes a term that encourages the output of the encoder

to match the generative model’s prior distribution

model

(

), which is usually

Gaussian (thus factorial and unimodal). The KL divergence term between

(

z | x

)

and

(

) controls how much noise needs to be injected when sampling from the

output of the encoder and the balance struck between reconstruction error (which

would be lower when no noise is injected) and the KL term determines how much

noise is injected. That injected noise (or entropy) plays a useful role to make

the decoder contractive: the decoder must learn to map all noisy variants of

associated with the same input

back to the same reconstructed

. This helps to

reduce the mismatch between the training phase (where the decoder sees

coming

from the encoder

(

z | x

)) and the generative phase (where the decoder sees

coming from the prior p

model

(z)).

The variational autoencoder approach is elegant, theoretically pleasing, and

simple to implement. It also obtains excellent results and is among the state of

the art approaches to generative modeling. Its main drawback is that samples

from variational autoencoders trained on images tend to be somewhat blurry. The

causes of this phenomenon are not yet known. One possibility is that the blurriness

is an intrinsic eﬀect of maximum likelihood, which minimizes

(

data

model

As illustrated in Fig. 3.6, this means that the model will assign high probability to

points that occur in the training set, but may also assign high probability to other

points. These other points may include blurry images. Part of the reason that the

model would choose to put probability mass on blurry images rather than some

other part of the space is that the variational autoencoders used in practice usually

have a Gaussian distribution for

model

(

;

(

)). Maximizing a lower bound on

the likelihood of such a distribution is similar to training a traditional autoencoder

with mean squared error, in the sense that it has a tendency to ignore features

of the input that occupy few pixels or that cause only a small change in the

brightness of the pixels that they occupy. This issue is not speciﬁc to VAEs and

is shared with generative models that optimize a log-likelihood, or equivalently,

(

data

p

model

), as argued by Theis et al. (2015) and by Huszar (2015). Another

troubling issue with contemporary VAE models is that they tend to use only a small

subset of the dimensions of

, as if the encoder was not able to transform enough

of the local directions in input space to a space where the marginal distribution

matches the factorized prior.

705

CHAPTER 20. DEEP GENERATIVE MODELS

The VAE framework is very straightforward to extend to a wide range of model

architectures. This is a key advantage over Boltzmann machines, which require

extremely careful model design to maintain tractability. VAEs work very well

with a diverse family of diﬀerentiable operators. One particularly sophisticated

VAE is the deep recurrent attention writer or DRAW model (Gregor et al., 2015).

DRAW uses a recurrent encoder and recurrent decoder combined with an attention

mechanism. The generation process for the DRAW model consists of sequentially

visiting diﬀerent small image patches and drawing the values of the pixels at those

points. VAEs can also be extended to generate sequences by deﬁning variational

RNNs (Chung et al., 2015b) by using a recurrent encoder and decoder within

the VAE framework. Generating a sample from a traditional RNN only involves

non-deterministic operations at the output space. Variational RNNs also have

random variability at the potentially more abstract level captured by the VAE

latent variables.

The VAE framework has been extended to maximize not just the traditional

variational lower bound, but instead the importance weighted autoencoder (Burda

et al., 2015) objective:

(x, q) = E

(1)

,...,z

(k)

∼q(z|x)



log



i=1

model

(x, z

(i)

)

q(z

(i)

| x)



. (20.65)

This new objective is equivalent to the traditional lower bound

when

= 1.

However, it may also be interpreted as forming an estimate of the true

log p

model

(

)

using importance sampling of

from proposal distribution

(

z | x

). The importance

weighted autoencoder objective is also a lower bound on

log p

model

(

) and becomes

tighter as k increases.

Variational autoencoders have some interesting connections to the MP-DBM

and other approaches that involve back-propagation through the approximate

graph (Goodfellow et al., 2013b; Stoyanov et al., 2011; Brakel et al., 2013). These

previous approaches required an inference procedure such as mean ﬁeld ﬁxed point

equations to provide the computational graph. The variational autoencoder is

deﬁned for arbitrary computational graphs, which makes it applicable to a wider

range of probabilistic model families because there is no need to restrict the choice

of models to those with tractable mean ﬁeld ﬁxed point equations. The variational

autoencoder also has the advantage that it increases a bound on the log-likelihood

of the model, while the criteria for the MP-DBM and related models are more

heuristic and have little probabilistic interpretation beyond making the results of

approximate inference accurate. One disadvantage of the variational autoencoder

is that it learns an inference network for only one problem, inferring

given

706

CHAPTER 20. DEEP GENERATIVE MODELS

The older methods are able to perform approximate inference over any subset of

variables given any other subset of variables, because the mean ﬁeld ﬁxed point

equations specify how to share parameters between the computational graphs for

all of these diﬀerent problems.

One very nice property of the variational autoencoder is that simultaneously

training a parametric encoder in combination with the generator network forces

the model to learn a predictable coordinate system that the encoder can capture.

This makes it an excellent manifold learning algorithm. See Fig. 20.6 for examples

of low-dimensional manifolds learned by the variational autoencoder. In one of

the cases demonstrated in the ﬁgure, the algorithm discovered two independent

factors of variation: angle of rotation and emotional expression.

Figure 20.6: Examples of two-dimensional coordinate systems for high-dimensional mani-

folds, learned by a variational autoencoder (Kingma and Welling, 2014a). Two dimensions

may be plotted directly on the page for visualization, so we can gain an understanding of

how the model works by training a model with a 2-D latent code, even if we believe the

intrinsic dimensionality of the data manifold is much higher. The images shown are not

examples from the training set but images

actually generated by the model

(

x | h

simply by changing the 2-D “code”

(each image corresponds to a diﬀerent choice of “code”

on a 2-D uniform grid). (Left) The two-dimensional map of the Frey faces manifold.

One dimension that has been discovered (horizontal) mostly corresponds to a rotation of

the face, while the other (vertical) corresponds to the emotional expression. (Right) The

two-dimensional map of the MNIST manifold.

707

CHAPTER 20. DEEP GENERATIVE MODELS

20.10.4 Generative Adversarial Networks

Generative adversarial networks or GANs (Goodfellow et al., 2014c) are another

generative modeling approach based on diﬀerentiable generator networks.

Generative adversarial networks are based on a game theoretic scenario in

which the generator network must compete against an adversary. The generator

network directly produces samples

(

;

(g)

). Its adversary, the discriminator

network, attempts to distinguish between samples drawn from the training data

and samples drawn from the generator. The discriminator emits a probability value

given by

(

;

(d)

), indicating the probability that

is a real training example

rather than a fake sample drawn from the model.

The simplest way to formulate learning in generative adversarial networks is

as a zero-sum game, in which a function

(

(g)

, θ

(d)

) determines the payoﬀ of the

discriminator. The generator receives

−v

(

(g)

, θ

(d)

) as its own payoﬀ. During

learning, each player attempts to maximize its own payoﬀ, so that at convergence

∗

= arg min

max

v(g, d). (20.66)

The default choice for v is

v(θ

(g)

, θ

(d)

) = E

x∼p

data

log d(x) + E

x∼p

model

log (1 − d(x)) . (20.67)

This drives the discriminator to attempt to learn to correctly classify samples as real

or fake. Simultaneously, the generator attempts to fool the classiﬁer into believing

its samples are real. At convergence, the generator’s samples are indistinguishable

from real data, and the discriminator outputs

everywhere. The discriminator

may then be discarded.

The main motivation for the design of GANs is that the learning process

requires neither approximate inference nor approximation of a partition function

gradient. In the case where

max

(

g, d

) is convex in

(g)

(such as the case where

optimization is performed directly in the space of probability density functions)

then the procedure is guaranteed to converge and is asymptotically consistent.

Unfortunately, learning in GANs can be diﬃcult. Goodfellow (2014) identi-

ﬁed non-convergence as an issue that may cause GANs to underﬁt. In general,

simultaneous gradient descent on two players’ costs is not guaranteed to reach

an equilibrium. Consider for example the value function

(

a, b

) =

, where one

player controls

and incurs cost

, while the other player controls

and receives

a cost

−ab

. If we model each player as making inﬁnitesimally small gradient steps,

each player reducing their own cost at the expense of the other player, then

and

708

CHAPTER 20. DEEP GENERATIVE MODELS

go into a stable, circular orbit, rather than arriving at the equilibrium point

at the origin. Note that the equilibria for a minimax game are not local minima

. Instead, they are points that are simultaneously minima for both players’

costs. This means that they are saddle points of

that are local minima with

respect to the ﬁrst player’s parameters and local maxima with respect to the second

player’s parameters. It is possible for the two players to take turns increasing

then decreasing

forever, rather than landing exactly on the saddle point where

neither player is capable of reducing their cost. It is not known to what extent

this non-convergence problem aﬀects GANs. Goodfellow (2014) identiﬁed an

alternative formulation of the payoﬀs, in which the game is no longer zero-sum,

that has the same expected gradient as maximum likelihood learning whenever the

discriminator is optimal. This alternative formulation does not seem to perform

well in practice, possibly due to suboptimality of the discriminator, or possibly due

to high variance around the expected gradient. In practice, the best-performing

formulation of the GAN game is a diﬀerent formulation that is neither zero-sum nor

equivalent to maximum likelihood, introduced by (Goodfellow et al., 2014c) with a

heuristic motivation. In this best-performing formulation, the generator aims to

increase the log probability that the discriminator makes a mistake, rather than

aiming to decrease the log probability that the discriminator makes the correct

prediction. This reformulation is motivated solely by the observation that it causes

the derivative of the generator’s cost function with respect to the discriminator’s

logits to remain large even in the situation where the discriminator conﬁdently

rejects all generator samples. Stabilization of GAN learning remains an open

problem.

Fortunately, GAN learning performs well when the model architecture and

hyperparameters are carefully selected. Radford et al. (2015) crafted a deep

convolutional GAN (DCGAN) that performs very well for image synthesis tasks,

and showed that its latent representation space captures important factors of

variation. See Fig. 20.7 for examples of images generated by a DCGAN generator.

The GAN learning problem can also be simpliﬁed by breaking the generation

process into many levels of detail. It is possible to train conditional GANs (Mirza

and Osindero, 2014) that learn to sample from a distribution

(

x | y

) rather

than simply sampling from a marginal distribution

(

). Denton et al. (2015)

showed that a series of conditional GANs can be trained to ﬁrst generate a very

low-resolution version of an image, then incrementally add details to the image.

This technique is called the LAPGAN model, due to the use of a Laplacian pyramid

to generate the images containing varying levels of detail. LAPGAN generators

are able to fool not only discriminator networks but also human observers, with

experimental subjects identifying up to 40% of the outputs of the network as being

709

CHAPTER 20. DEEP GENERATIVE MODELS

Figure 20.7: Images generated by GANs trained on the LSUN dataset. (Left) Images

of bedrooms generated by a DCGAN model, reproduced with permission from Radford

et al. (2015). (Right) Images of churches generated by a LAPGAN model, reproduced

with permission from Denton et al. (2015).

real data. See Fig. 20.7 for examples of images generated by a LAPGAN generator.

One unusual capability of the GAN training procedure is that it can ﬁt proba-

bility distributions that assign zero probability to the training points. Rather than

maximizing the log probability of speciﬁc points, the generator net learns to trace

out a manifold whose points resemble training points in some way. Somewhat para-

doxically, this means that the model may assign a log-likelihood of negative inﬁnity

to the test set, while still representing a manifold that a human observer judges

to capture the essence of the generation task. This is not clearly an advantage or

a disadvantage, and one may also guarantee that the generator network assigns

non-zero probability to all points simply by making the last layer of the generator

network add Gaussian noise to all of the generated values. Generator networks

that add Gaussian noise in this manner sample from the same distribution that one

obtains by using the generator network to parametrize the mean of a conditional

Gaussian distribution.

Dropout seems to be important in the discriminator network. In particular,

units should be stochastically dropped while computing the gradient for the

generator network to follow. Following the gradient of the deterministic version of

the discriminator with its weights divided by two does not seem to be as eﬀective.

Likewise, never using dropout seems to yield poor results.

While the GAN framework is designed for diﬀerentiable generator networks,

similar principles can be used to train other kinds of models. For example, self-

710

CHAPTER 20. DEEP GENERATIVE MODELS

supervised boosting can be used to train an RBM generator to fool a logistic

regression discriminator (Welling et al., 2002).

20.10.5 Generative Moment Matching Networks

Generative moment matching networks (Li et al., 2015; Dziugaite et al., 2015)

are another form of generative model based on diﬀerentiable generator networks.

Unlike VAEs and GANs, they do not need to pair the generator network with any

other network—neither an inference network as used with VAEs nor a discriminator

network as used with GANs.

These networks are trained with a technique called moment matching. The

basic idea behind moment matching is to train the generator in such a way that

many of the statistics of samples generated by the model are as similar as possible

to those of the statistics of the examples in the training set. In this context, a

moment is an expectation of diﬀerent powers of a random variable. For example,

the ﬁrst moment is the mean, the second moment is the mean of the squared

values, and so on. In multiple dimensions, each element of the random vector may

be raised to diﬀerent powers, so that a moment may be any quantity of the form

(20.68)

where n = [n

, n

, . . . , n

]



is a vector of non-negative integers.

Upon ﬁrst examination, this approach seems to be computationally infeasible.

For example, if we want to match all the moments of the form

, then we need

to minimize the diﬀerence between a number of values that is quadratic in the

dimension of

. Moreover, even matching all of the ﬁrst and second moments

would only be suﬃcient to ﬁt a multivariate Gaussian distribution, which captures

only linear relationships between values. Our ambitions for neural networks are to

capture complex nonlinear relationships, which would require far more moments.

GANs avoid this problem of exhaustively enumerating all moments by using a

dynamically updated discriminator, that automatically focuses its attention on

whichever statistic the generator network is matching the least eﬀectively.

Instead, generative moment matching networks can be trained by minimizing

a cost function called maximum mean discrepancy (Schölkopf and Smola, 2002;

Gretton et al., 2012) or MMD. This cost function measures the ﬁrst moment in

an inﬁnite-dimensional space, using an implicit mapping to feature space deﬁned

by the kernel function in order to make computations on inﬁnite-dimensional

vectors tractable. The MMD cost is zero if and only if the two distributions being

compared are equal.

711

CHAPTER 20. DEEP GENERATIVE MODELS

Visually, the samples from generative moment matching networks are somewhat

disappointing. Fortunately, they can be improved by combining the generator

network with an autoencoder. First, an autoencoder is trained to reconstruct the

training set. Next, the encoder of the autoencoder is used to transform the entire

training set into code space. The generator network is then trained to generate

code samples, which may be mapped to visually pleasing samples via the decoder.

Unlike GANs, the cost function is deﬁned only with respect to a batch of

examples from both the training set and the generator network. It is not possible

to make a training update as a function of only one training example or only

one sample from the generator network. This is because the moments must be

computed as an empirical average across many samples. When the batch size is too

small, MMD can underestimate the true amount of variation in the distributions

being sampled. No ﬁnite batch size is suﬃciently large to eliminate this problem

entirely, but larger batches reduce the amount of underestimation. When the batch

size is too large, the training procedure becomes infeasibly slow, because many

examples must be processed in order to compute a single small gradient step.

As with GANs, it is possible to train a generator net using MMD even if that

generator net assigns zero probability to the training points.

20.10.6 Convolutional Generative Networks

When generating images, it is often useful to use a generator network that includes

a convolutional structure (see for example Goodfellow et al. (2014c) or Dosovitskiy

et al. (2015)). To do so, we use the “transpose” of the convolution operator,

described in Sec. 9.5. This approach often yields more realistic images and does

so using fewer parameters than using fully connected layers without parameter

sharing.

Convolutional networks for recognition tasks have information ﬂow from the

image to some summarization layer at the top of the network, often a class label.

As this image ﬂows upward through the network, information is discarded as the

representation of the image becomes more invariant to nuisance transformations.

In a generator network, the opposite is true. Rich details must be added as

the representation of the image to be generated propagates through the network,

culminating in the ﬁnal representation of the image, which is of course the image

itself, in all of its detailed glory, with object positions and poses and textures and

lighting. The primary mechanism for discarding information in a convolutional

recognition network is the pooling layer. The generator network seems to need to

add information. We cannot put the inverse of a pooling layer into the generator

712

CHAPTER 20. DEEP GENERATIVE MODELS

network because most pooling functions are not invertible. A simpler operation is

to merely increase the spatial size of the representation. An approach that seems

to perform acceptably is to use an “un-pooling” as introduced by Dosovitskiy et al.

(2015). This layer corresponds to the inverse of the max-pooling operation under

certain simplifying conditions. First, the stride of the max-pooling operation is

constrained to be equal to the width of the pooling region. Second, the maximum

input within each pooling region is assumed to be the input in the upper-left

corner. Finally, all non-maximal inputs within each pooling region are assumed to

be zero. These are very strong and unrealistic assumptions, but they do allow the

max-pooling operator to be inverted. The inverse un-pooling operation allocates

a tensor of zeros, then copies each value from spatial coordinate

of the input

to spatial coordinate

i × k

of the output. The integer value

deﬁnes the size

of the pooling region. Even though the assumptions motivating the deﬁnition of

the un-pooling operator are unrealistic, the subsequent layers are able to learn to

compensate for its unusual output, so the samples generated by the model as a

whole are visually pleasing.

20.11 Auto-Regressive Networks

Auto-regressive networks are directed probabilistic models with no latent random

variables. The conditional probability distributions in these models are represented

by neural networks (sometimes extremely simple neural networks such as logistic

regression). The graph structure of these models is the complete graph. They

decompose a joint probability over the observed variables using the chain rule of

probability to obtain a product of conditionals of the form

(

| x

d−1

, . . . , x

Such models have been called fully-visible Bayes networks (FVBNs) and used

successfully in many forms, ﬁrst with logistic regression for each conditional

distribution (Frey, 1998) and then with neural networks with hidden units (Bengio

and Bengio, 2000b; Larochelle and Murray, 2011). In some forms of auto-

regressive networks, such as NADE (Larochelle and Murray, 2011), described in

Sec. 20.11.3 below, we can introduce a form of parameter sharing that brings both

a statistical advantage (fewer unique parameters) and a computational advantage

(less computation). This is one more instance of the recurring deep learning motif

of reuse of features.

713

CHAPTER 20. DEEP GENERATIVE MODELS

P (x

| x

)P (x

| x

)

P (x

| x

)P (x

| x

)

P (x

| x

)P (x

| x

)

P (x

)P (x

)

Figure 20.8: A fully visible belief network predicts the

-th variable from the

i −

previous ones. (Top) The directed graphical model for an FVBN. (Bottom) Corresponding

computational graph, in the case of the logistic FVBN, where each prediction is made by

a linear predictor.

20.11.1 Linear Auto-Regressive Networks

The simplest form of auto-regressive network has no hidden units and no sharing

of parameters or features. Each

(

| x

i−1

, . . . , x

) is parametrized as a linear

model (linear regression for real-valued data, logistic regression for binary data,

softmax regression for discrete data). This model was introduced by Frey (1998)

and has

(

) parameters when there are

variables to model. It is illustrated in

Fig. 20.8.

If the variables are continuous, a linear auto-regressive model, is merely another

way to formulate a Gaussian distribution, capturing linear pairwise interactions

between the observed variables.

Linear auto-regressive networks are essentially the generalization of linear

classiﬁcation methods to generative modeling. They therefore have the same

advantages and disadvantages as linear classiﬁers. Like linear classiﬁers, they may

be trained with convex loss functions, and sometimes admit closed form solutions

(as in the Gaussian case). Like linear classiﬁers, the model itself does not oﬀer

a way of increasing its capacity, so capacity must be raised using techniques like

basis expansions of the input or the kernel trick.

714

CHAPTER 20. DEEP GENERATIVE MODELS

P (x

| x

)P (x

| x

)

P (x

| x

)P (x

| x

)

P (x

| x

)P (x

| x

)

P (x

)P (x

)

Figure 20.9: A neural auto-regressive network predicts the

-th variable

from the

i −

previous ones, but is parametrized so that features (groups of hidden units denoted

)

that are functions of

, . . . , x

can be reused in predicting all of the subsequent variables

i+1

, x

i+2

, . . . , x

20.11.2 Neural Auto-Regressive Networks

Neural auto-regressive networks (Bengio and Bengio, 2000b,a) have the same

left-to-right graphical model as logistic auto-regressive networks (Fig. 20.8) but

employ a diﬀerent parametrization of the conditional distributions within that

graphical model structure. The new parametrization is more powerful in the sense

that its capacity can be increased as much as needed, allowing approximation of

any joint distribution. The new parametrization can also improve generalization

by introducing a parameter sharing and feature sharing principle common to deep

learning in general. The models were motivated by the objective to avoid the

curse of dimensionality arising out of traditional non-parametric graphical models,

sharing the same structure as Fig. 20.8. In non-parametric discrete probabilistic

models, each conditional distribution is represented by a table of probabilities,

with one entry and one parameter for each possible conﬁguration of the variables

involved. By using a neural network instead, two advantages are obtained:

The parametrization of each

(

| x

i−1

, . . . , x

) by a neural network with

(

i −

× k

inputs and

outputs (if the variables are discrete and take

values, encoded one-hot) allows one to estimate the conditional probability

without requiring an exponential number of parameters (and examples), yet

still is able to capture high-order dependencies between the random variables.

Instead of having a diﬀerent neural network for the prediction of each

715

CHAPTER 20. DEEP GENERATIVE MODELS

a left-to-right connectivity illustrated in Fig. 20.9 allows one to merge all

the neural networks into one. Equivalently, it means that the hidden layer

features computed for predicting

can be reused for predicting

i+k

(

k >

0).

The hidden units are thus organized in groups that have the particularity

that all the units in the

-th group only depend on the input values

, . . . , x

The parameters used to compute these hidden units are jointly optimized

to improve the prediction of all the variables in the sequence. This is

an instance of the reuse principle that recurs throughout deep learning in

scenarios ranging from recurrent and convolutional network architectures to

multi-task and transfer learning.

Each

(

| x

i−1

, . . . , x

) can represent a conditional distribution by having

outputs of the neural network predict parameters of the conditional distribution

, as discussed in Sec. 6.2.1.1. Although the original neural auto-regressive

networks were initially evaluated in the context of purely discrete multivariate

data (with a sigmoid output for a Bernoulli variable or softmax output for a

multinoulli variable) it is natural to extend such models to continuous variables or

joint distributions involving both discrete and continuous variables.

20.11.3 NADE

The neural autoregressive density estimator (NADE) is a very successful recent form

of neural auto-regressive network (Larochelle and Murray, 2011). The connectivity

is the same as for the original neural auto-regressive network of Bengio and

Bengio (2000b) but NADE introduces an additional parameter sharing scheme, as

illustrated in Fig. 20.10. The parameters of the hidden units of diﬀerent groups

are shared.

The weights



j,k,i

from the

-th input

to the

-th element of the

-th group

of hidden unit h

(j)

(j ≥ i) are shared among the groups:



j,k,i

= W

k,i

. (20.69)

The remaining weights, where j < i, are zero.

Larochelle and Murray (2011) chose this sharing scheme so that forward

propagation in a NADE model loosely resembles the computations performed in

mean ﬁeld inference to ﬁll in missing inputs in an RBM. This mean ﬁeld inference

corresponds to running a recurrent network with shared weights and the ﬁrst step

of that inference is the same as in NADE. The only diﬀerence is that with the

proposed NADE, the output weights are not forced to be simply transpose values of

716

CHAPTER 20. DEEP GENERATIVE MODELS

P (x

| x

)P (x

| x

)

P (x

| x

)P (x

| x

)

P (x

| x

)P (x

| x

)

P (x

)P (x

)

:,1

:,2

:,3

Figure 20.10: An illustration of the neural autoregressive density estimator (NADE). The

hidden units are organized in groups

(j)

so that only the inputs

, . . . , x

participate

in computing

(i)

and predicting

(

| x

j−1

, . . . , x

), for

j > i

. NADE is diﬀerentiated

from earlier neural auto-regressive networks by the use of a particular weight sharing

pattern:



j,k,i

k,i

is shared (indicated in the ﬁgure by the use of the same line pattern

for every instance of a replicated weight) for all the weights going out from

to the

-th

unit of any group j ≥ i. Recall that the vector (W

1,i

, W

2,i

, . . . , W

n,i

) is denoted W

:,i

the input weights (they are not tied). This procedure can be extended to perform

not just one time step of the mean ﬁeld recurrent inference but to

steps, as

in Raiko et al. (2014).

Although the neural auto-regressive networks and NADE were originally pro-

posed to deal with discrete distributions, they can in principle be generalized to

continuous ones by replacing the conditional discrete probability distributions

(for

(

| x

j−1

, . . . , x

)) by continuous ones and following general practice to

predict continuous random variables with neural networks (see Sec. 6.2.1.1) using

the log-likelihood framework. A fairly generic way of parametrizing a continuous

density is as a Gaussian mixture (introduced in Sec. 3.9.6) with mixture weights

(the coeﬃcient or prior probability for component

), per-component conditional

mean

and per-component conditional variance

. A model called RNADE

(Uria et al., 2013) uses this parametrization to extend NADE to real values. The

mixture weights are probabilities so if they are conditional on the input they can

be the result of applying a softmax nonlinearity to a function of the input. The

variances must be parametrized so that they are positive. Stochastic gradient

descent can be numerically ill-behaved due to the interactions between the condi-

tional means

and the conditional variances

. To reduce this diﬃculty, Uria

717

CHAPTER 20. DEEP GENERATIVE MODELS

et al. (2013) use a pseudo-gradient that replaces the gradient on the mean, in the

back-propagation phase. The gradient on the mean

is modiﬁed by dividing it

by the conditional variance

. Hence, whereas the gradient on

is proportional

−x

, the pseudo-gradient that replaces it is proportional to

−x

. This avoids

the very large gradients that arise when the variance becomes small.

Another very interesting extension of the neural auto-regressive architectures

gets rid of the need to choose an arbitrary order for the observed variables (Murray

and Larochelle, 2014). In auto-regressive networks, the idea is to train the network

to be able to cope with any order by randomly sampling orders and providing the

information to hidden units specifying which of the inputs are observed (on the

right side of the conditioning bar) and which are to be predicted and are thus

considered missing (on the left side of the conditioning bar). This is nice because

it allows one to use a trained auto-regressive network to perform any inference (i.e.

predict or sample from the probability distribution over any subset of variables

given any subset) extremely eﬃciently. Finally, since many orders of variables are

possible (

! for

variables) and each order

of variables yields a diﬀerent

(

x | o

we can form an ensemble of models for many values of o:

ensemble

(x) =



i=1

p(x | o

(i)

). (20.70)

This ensemble model usually generalizes better and assigns higher probability to

the test set than does an individual model deﬁned by a single ordering.

In the same paper, the authors propose deep versions of the architecture, but

unfortunately that immediately makes computation as expensive as in the original

neural auto-regressive neural network (Bengio and Bengio, 2000b). The ﬁrst layer

and the output layer can still be computed in

(

) multiply-add operations,

as in the regular NADE, where

is the number of hidden units (the size of the

groups

, in Fig. 20.10 and Fig. 20.9), whereas it is

(

) in Bengio and Bengio

(2000b). However, for the other hidden layers, the computation is

(

) if every

“previous” group at layer

participates in predicting the “next” group at layer

+ 1,

assuming

groups of

hidden units at each layer. Making the

-th group at layer

+ 1 only depend on the

-th group, as in Murray and Larochelle (2014) at layer

reduces it to O(nh

), which is still h times worse than the regular NADE.

20.12 Drawing Samples from Autoencoders

In Chapter 14, we saw that many kinds of autoencoders learn the data distribution.

There are close connections between score matching, denoising autoencoders, and

718

CHAPTER 20. DEEP GENERATIVE MODELS

contractive autoencoders. We have not yet seen how to draw samples from such

models.

Some kinds of autoencoders, such as the variational autoencoder, explicitly

represent a probability distribution and admit straightforward ancestral sampling.

Most other kinds of autoencoders require MCMC sampling.

Contractive autoencoders are designed to recover an estimate of the tangent

plane of the data manifold. This means that repeated encoding and decoding with

injected noise will induce a random walk along the surface of the manifold (Rifai

et al., 2012; Mesnil et al., 2012). This manifold diﬀusion technique is a kind of

Markov chain.

There is also a more general Markov chain that can sample from any denoising

autoencoder.

20.12.1 Markov Chain Associated with any Denoising Autoen-

coder

The above discussion left open the question of what noise to inject and where, in

order to obtain a Markov chain that would generate from the distribution estimated

by the autoencoder. Bengio et al. (2013c) showed how to construct such a Markov

chain for generalized denoising autoencoders. Generalized denoising autoencoders

are speciﬁed by a denoising distribution for sampling an estimate of the clean input

given the corrupted input.

Each step of the Markov chain that generates from the estimated distribution

consists of the following sub-steps, illustrated in Fig. 20.11:

Starting from the previous state

, inject corruption noise, sampling

from

x | x).

2. Encode

x into h = f(

x).

3. Decode h to obtain the parameters ω = g(h) of p(x | ω = g(h)) = p(x |

x).

4. Sample the next state x from p(x | ω = g(h)) = p(x |

x).

Bengio et al. (2014) showed that if the autoencoder

(

x |

) forms a consistent

estimator of the corresponding true conditional distribution, then the stationary

distribution of the above Markov chain forms a consistent estimator (albeit an

implicit one) of the data generating distribution of x.

719

CHAPTER 20. DEEP GENERATIVE MODELS

x | x)

p(x | !)

Figure 20.11: Each step of the Markov chain associated with a trained denoising autoen-

coder, that generates the samples from the probabilistic model implicitly trained by the

denoising log-likelihood criterion. Each step consists in (a) injecting noise via corruption

process

in state

, yielding

, (b) encoding it with function

, yielding

(

, yielding parameters

for the reconstruction

distribution, and (d) given

, sampling a new state from the reconstruction distribution

(

x | ω

(

))). In the typical squared reconstruction error case,

(

) =

, which

estimates

[

x |

], corruption consists in adding Gaussian noise and sampling from

(

x | ω

) consists in adding Gaussian noise, a second time, to the reconstruction

. The

latter noise level should correspond to the mean squared error of reconstructions, whereas

the injected noise is a hyperparameter that controls the mixing speed as well as the

extent to which the estimator smooths the empirical distribution (Vincent, 2011). In the

example illustrated here, only the

and

conditionals are stochastic steps (

and

are

deterministic computations), although noise can also be injected inside the autoencoder,

as in generative stochastic networks (Bengio et al., 2014).

720

CHAPTER 20. DEEP GENERATIVE MODELS

20.12.2 Clamping and Conditional Sampling

Similarly to Boltzmann machines, denoising autoencoders and their generalizations

(such as GSNs, described below) can be used to sample from a conditional distri-

bution

(

| x

), simply by clamping the observed units

and only resampling

the free units

given

and the sampled latent variables (if any). For example,

MP-DBMs can be interpreted as a form of denoising autoencoder, and are able

to sample missing inputs. GSNs later generalized some of the ideas present in

MP-DBMs to perform the same operation (Bengio et al., 2014). Alain et al. (2015)

identiﬁed a missing condition from Proposition 1 of Bengio et al. (2014), which is

that the transition operator (deﬁned by the stochastic mapping going from one

state of the chain to the next) should satisfy a property called detailed balance,

which speciﬁes that a Markov Chain at equilibrium will remain in equilibrium

whether the transition operator is run in forward or reverse.

An experiment in clamping half of the pixels (the right part of the image) and

running the Markov chain on the other half is shown in Fig. 20.12.

Figure 20.12: Illustration of clamping the right half of the image and running the Markov

Chain by resampling only the left half at each step. These samples come from a GSN

trained to reconstruct MNIST digits at each time step using the walkback procedure.

721

CHAPTER 20. DEEP GENERATIVE MODELS

20.12.3 Walk-Back Training Procedure

The walk-back training procedure was proposed by Bengio et al. (2013c) as a way

to accelerate the convergence of generative training of denoising autoencoders.

Instead of performing a one-step encode-decode reconstruction, this procedure

consists in alternative multiple stochastic encode-decode steps (as in the generative

Markov chain) initialized at a training example (just like with the contrastive

divergence algorithm, described in Sec. 18.2) and penalizing the last probabilistic

reconstructions (or all of the reconstructions along the way).

Training with

steps is equivalent (in the sense of achieving the same stationary

distribution) as training with one step, but practically has the advantage that

spurious modes farther from the data can be removed more eﬃciently.

20.13 Generative Stochastic Networks

Generative stochastic networks or GSNs (Bengio et al., 2014) are generalizations of

denoising autoencoders that include latent variables

in the generative Markov

chain, in addition to the visible variables (usually denoted x).

A GSN is parametrized by two conditional probability distributions which

specify one step of the Markov chain:

1. p

(

(k)

| h

(k)

) tells how to generate the next visible variable given the current

latent state. Such a “reconstruction distribution” is also found in denoising

autoencoders, RBMs, DBNs and DBMs.

2. p

(

(k)

| h

(k−1)

, x

(k−1)

) tells how to update the latent state variable, given

the previous latent state and visible variable.

Denoising autoencoders and GSNs diﬀer from classical probabilistic models

(directed or undirected) in that they parametrize the generative process itself rather

than the mathematical speciﬁcation of the joint distribution of visible and latent

variables. Instead, the latter is deﬁned implicitly, if it exists, as the stationary

distribution of the generative Markov chain. The conditions for existence of the

stationary distribution are mild and are the same conditions required by standard

MCMC methods (see Sec. 17.2). These conditions are necessary to guarantee

that the chain mixes, but they can be violated by some choices of the transition

distributions (for example, if they were deterministic).

One could imagine diﬀerent training criteria for GSNs. The one proposed and

evaluated by Bengio et al. (2014) is simply reconstruction log-probability on the

722

CHAPTER 20. DEEP GENERATIVE MODELS

visible units, just like for denoising autoencoders. This is achieved by clamping

(0)

to the observed example and maximizing the probability of generating

at some subsequent time steps, i.e., maximizing

log p

(

(k)

x | h

(k)

), where

(k)

is sampled from the chain, given

(0)

. In order to estimate the gradient of

log p

(

(k)

x | h

(k)

) with respect to the other pieces of the model, Bengio et al.

(2014) use the reparametrization trick, introduced in Sec. 20.9.

The walk-back training protocol (described in Sec. 20.12.3) was used (Bengio

et al., 2014) to improve training convergence of GSNs.

20.13.1 Discriminant GSNs

The original formulation of GSNs (Bengio et al., 2014) was meant for unsupervised

learning and implicitly modeling

(

) for observed data

, but it is possible to

modify the framework to optimize p(y | x).

For example, Zhou and Troyanskaya (2014) generalize GSNs in this way, by

only back-propagating the reconstruction log-probability over the output variables,

keeping the input variables ﬁxed. They applied this successfully to model sequences

(protein secondary structure) and introduced a (one-dimensional) convolutional

structure in the transition operator of the Markov chain. It is important to

remember that, for each step of the Markov chain, one generates a new sequence

for each layer, and that sequence is the input for computing other layer values (say

the one below and the one above) at the next time step.

Hence the Markov chain is really over the output variable (and associated higher-

level hidden layers), and the input sequence only serves to condition that chain,

with back-propagation allowing to learn how the input sequence can condition the

output distribution implicitly represented by the Markov chain. It is therefore a

case of using the GSN in the context of structured outputs, where

(

y | x

) does not

have a simple parametric form but instead the components of

are statistically

dependent of each other, given x, in complicated ways.

Zöhrer and Pernkopf (2014) introduced a hybrid model that combines a super-

vised objective (as in the above work) and an unsupervised objective (as in the

original GSN work), by simply adding (with a diﬀerent weight) the supervised and

unsupervised costs i.e., the reconstruction log-probabilities of

and

respectively.

Such a hybrid criterion had previously been introduced for RBMs by Larochelle

and Bengio (2008). They show improved classiﬁcation performance using this

scheme.

723

CHAPTER 20. DEEP GENERATIVE MODELS

20.14 Other Generation Schemes

The methods we have described so far use either MCMC sampling, ancestral

sampling, or some mixture of the two to generate samples. While these are the

most popular approaches to generative modeling, they are by no means the only

approaches.

Sohl-Dickstein et al. (2015) developed a diﬀusion inversion training scheme

for learning a generative model, based on non-equilibrium thermodynamics. The

approach is based on the idea that the probability distributions we wish to sample

from have structure. This structure can gradually be destroyed by a diﬀusion

process that incrementally changes the probability distribution to have more

entropy. To form a generative model, we can run the process in reverse, by training

a model that gradually restores the structure to an unstructured distribution. By

iteratively applying a process that brings a distribution closer to the target one, we

can gradually approach that target distribution. This approach resembles MCMC

methods in the sense that it involves many iterations to produce a sample. However,

the model is deﬁned to be the probability distribution produced by the ﬁnal step

of the chain. In this sense, there is no approximation induced by the iterative

procedure. The approach introduced by Sohl-Dickstein et al. (2015) is also very

close to the generative interpretation of the denoising autoencoder (Sec. 20.12.1).

Like with the denoising autoencoder, the training objective trains a transition

operator which attempts to probabilistically undo the eﬀect of adding some noise,

trying to undo one step of the diﬀusion process. If we compare with the walkback

training procedure (Sec. 20.12.3) for denoising autoencoders and GSNs, the main

diﬀerence is that instead of reconstructing only towards the observed training point

, the objective function only tries to reconstruct towards the previous point in

the diﬀusion trajectory that started at

(which should be easier). This addresses

the following dilemma present with the ordinary reconstruction log-likelihood

objective of denoising autoencoders: with small levels of noise the learner only sees

conﬁgurations near the data points, while with large levels of noise it is asked to do

an almost impossible job (because the denoising distribution is going to be highly

complex and multi-modal). With the diﬀusion inversion objective, the learner can

learn more precisely the shape of the density around the data points as well as

remove spurious modes that could show up far from the data points.

Another approach to sample generation is the approximate Bayesian computa-

tion (ABC) framework (Rubin et al., 1984). In this approach, samples are rejected

or modiﬁed in order to make the moments of selected functions of the samples

match those of the desired distribution. While this idea uses the moments of the

samples like in moment matching, it is diﬀerent from moment matching because it

724

CHAPTER 20. DEEP GENERATIVE MODELS

modiﬁes the samples themselves, rather than training the model to automatically

emit samples with the correct moments. Bachman and Precup (2015) showed how

to use ideas from ABC in the context of deep learning, by using ABC to shape the

MCMC trajectories of GSNs.

We expect that many other possible approaches to generative modeling await

discovery.

20.15 Evaluating Generative Models

Researchers studying generative models often need to compare one generative

model to another, usually in order to demonstrate that a newly invented generative

model is better at capturing some distribution than the pre-existing models.

This can be a diﬃcult and subtle task. In many cases, we can not actually

evaluate the log probability of the data under the model, but only an approximation.

In these cases, it is important to think and communicate clearly about exactly what

is being measured. For example, suppose we can evaluate a stochastic estimate of

the log-likelihood for model A, and a deterministic lower bound on the log-likelihood

for model B. If model A gets a higher score than model B, which is better? If we

care about determining which model has a better internal representation of the

distribution, we actually cannot tell, unless we have some way of determining how

loose the bound for model B is. However, if we care about how well we can use

the model in practice, for example to perform anomaly detection, then it is fair to

say that a model is preferable based on a criterion speciﬁc to the practical task of

interest, e.g., based on ranking test examples and ranking criteria such as precision

and recall.

Another subtlety of evaluating generative models is that the evaluation metrics

are often hard research problems in and of themselves. It can be very diﬃcult

to establish that models are being compared fairly. For example, suppose we use

AIS to estimate

log Z

in order to compute

log ˜p

(

)

− log Z

for a new model we

have just invented. A computationally economical implementation of AIS may fail

to ﬁnd several modes of the model distribution and underestimate

, which will

result in us overestimating

log p

(

). It can thus be diﬃcult to tell whether a high

likelihood estimate is due to a good model or a bad AIS implementation.

Other ﬁelds of machine learning usually allow for some variation in the pre-

processing of the data. For example, when comparing the accuracy of object

recognition algorithms, it is usually acceptable to preprocess the input images

slightly diﬀerently for each algorithm based on what kind of input requirements it

725

CHAPTER 20. DEEP GENERATIVE MODELS

has. Generative modeling is diﬀerent because changes in preprocessing, even very

small and subtle ones, are completely unacceptable. Any change to the input data

changes the distribution to be captured and fundamentally alters the task. For

example, multiplying the input by 0.1 will artiﬁcially increase likelihood by 10.

Issues with preprocessing commonly arise when benchmarking generative models

on the MNIST dataset, one of the more popular generative modeling benchmarks.

MNIST consists of grayscale images. Some models treat MNIST images as points

in a real vector space, while others treat them as binary. Yet others treat the

grayscale values as probabilities for a binary samples. It is essential to compare

real-valued models only to other real-valued models and binary-valued models only

to other binary-valued models. Otherwise the likelihoods measured are not on the

same space. For binary-valued models, the log-likelihood can be at most zero, while

for real-valued models it can be arbitrarily high, since it is the measurement of a

density. Among binary models, it is important to compare models using exactly

the same kind of binarization. For example, we might binarize a gray pixel to 0 or 1

by thresholding at 0.5, or by drawing a random sample whose probability of being

1 is given by the gray pixel intensity. If we use the random binarization, we might

binarize the whole dataset once, or we might draw a diﬀerent random example for

each step of training and then draw multiple samples for evaluation. Each of these

three schemes yields wildly diﬀerent likelihood numbers, and when comparing

diﬀerent models it is important that both models use the same binarization scheme

for training and for evaluation. In fact, researchers who apply a single random

binarization step share a ﬁle containing the results of the random binarization, so

that there is no diﬀerence in results based on diﬀerent outcomes of the binarization

step.

Because being able to generate realistic samples from the data distribution

is one of the goals of a generative model, practitioners often evaluate generative

models by visually inspecting the samples. In the best case, this is done not by the

researchers themselves, but by experimental subjects who do not know the source

of the samples (Denton et al., 2015). Unfortunately, it is possible for a very poor

probabilistic model to produce very good samples. A common practice to verify

if the model only copies some of the training examples is illustrated in Fig. 16.1.

The idea is to show for some of the generated samples their nearest neighbor in

the training set, according to Euclidean distance in the space of

. The model can

overﬁt the training set and just reproduce training instances. It is even possible to

simultaneously underﬁt and overﬁt yet still produce samples that individually look

good. Imagine a generative model trained on images of dogs and cats that simply

learns to reproduce the training images of dogs. Such a model has clearly overﬁt,

because it does not produces images that were not in the training set, but it has

726

CHAPTER 20. DEEP GENERATIVE MODELS

also underﬁt, because it assigns no probability to the training images of cats. Yet

a human observer would judge each individual image of a dog to be high quality.

In this simple example, it would be easy for a human observer who can inspect

many samples to determine that the cats are absent. In more realistic settings, a

generative model trained on data with tens of thousands of modes may ignore a

small number of modes, and a human observer would not easily be able to inspect

or remember enough images to detect the missing variation.

Since the visual quality of samples is not a reliable guide, we often also

evaluate the log-likelihood that the model assigns to the test data, when this is

computationally feasible. Unfortunately, in some cases the likelihood seems not

to measure any attribute of the model that we really care about. For example,

real-valued models of MNIST can obtain arbitrarily high likelihood by assigning

arbitrarily low variance to background pixels that never change. Models and

algorithms that detect these constant features can reap unlimited rewards, even

though this is not a very useful thing to do. The potential to achieve a cost

approaching negative inﬁnity is present for any kind of maximum likelihood

problem with real values, but it is especially problematic for generative models of

MNIST because so many of the output values are trivial to predict. This strongly

suggests a need for developing other ways of evaluating generative models.

Theis et al. (2015) review many of the issues involved in evaluating generative

models, including many of the ideas described above. They highlight the fact

that there are many diﬀerent uses of generative models and that the choice of

metric must match the intended use of the model. For example, some generative

models are better at assigning high probability to most realistic points while other

generative models are better at rarely assigning high probability to unrealistic

points. These diﬀerences can result from whether a generative model is designed

to minimize

(

data

p

model

) or

(

model

p

data

), as illustrated in Fig. 3.6.

Unfortunately, even when we restrict the use of each metric to the task it is most

suited for, all of the metrics currently in use continue to have serious weaknesses.

One of the most important research topics in generative modeling is therefore not

just how to improve generative models, but in fact how to measure our progress.

20.16 Conclusion

Training generative models with hidden units is a powerful way to make models

understand the world represented in the given training data. By learning a model

model

(

) and a representation

model

(

h | x

), a generative model can provide

answers to many inference problems about the relationships between input variables

727

CHAPTER 20. DEEP GENERATIVE MODELS

and can provide many diﬀerent ways of representing

by taking expectations

at diﬀerent layers of the hierarchy. Generative models hold the promise to

provide AI systems with a framework for all of the many diﬀerent intuitive concepts

they need to understand, and the ability to reason about these concepts in the

face of uncertainty. We hope that our readers will ﬁnd new ways to make these

approaches more powerful and continue the journey to understanding the principles

that underlie learning and intelligence.

728