
CHAPTER 20. DEEP GENERATIVE MODELS
Denoising auto-encoders and GSNs differ from classical probabilistic models
(directed or undirected) in that it parametrizes the generative process itself rather
than the mathematical specification of the joint distribution of visible and latent
variables. Instead, the latter is defined implicitly, if it exists, as the stationary
distribution of the generative Markov chain. The conditions for existence of the
stationary distribution are mild (basically, the chain mixes) but can be violated
by some choices of the transition distributions (for example, if they were deter-
ministic).
One could imagine different training criteria for GSNs. The one proposed and
evaluated by Bengio et al. (2014b) is simply reconstruction log-probability on the
visible units, just like for denoising auto-encoders. This is achieved by clamping
X
0
= x to the observed example and maximizing the probability of generating x
at some subsequent time steps, i.e., maximizing log P (X
k
= x | H
k
), where H
k
is sampled from the chain, given X
0
= x. In order to estimate the gradient of
log P (X
k
= x | H
k
) with respect to the other pieces of the model, Bengio et al.
(2014b) use the reparametrization trick, introduced in Section 13.5.1.
The walk-back training protocol (described in Section 20.10.3 was used (Bengio
et al., 2014b) to improve training convergence of GSNS.
20.11.1 Discriminant GSNs
Whereas the original formulation of GSNs (Bengio et al., 2014b) was meant for
unsupervised learning and implicitly modeling P (x) for observed data x, it is
possible to modify the framework to optimize P (y | x).
For example, Zhou and Troyanskaya (2014) generalize GSNs in this way, by
only back-propagating the reconstruction log-probability over the output vari-
ables, keeping the input variables fixed. They applied this successfully to model
sequences (protein secondary structure) and introduced a (one-dimensional) con-
volutional structure in the transition operator of the Markov chain. Keep in mind
that, for each step of the Markov chain, one generates a new sequence for each
layer, and that sequence is the input for computing other layer values (say the
one below and the one above) at the next time step, as illustrated in Figure 20.11.
Hence the Markov chain is really over the output variable (and associated
higher-level hidden layers), and the input sequence only serves to condition that
chain, with back-propagation allowing to learn how the input sequence can con-
dition the output distribution implicitly represented by the Markov chain. It is
therefore a case of using the GSN in the context of structured outputs, where
P (y | x) does not have a simple parametric form but instead the components of
y are statistically dependent of each other, given x, in complicated ways.
Z¨ohrer and Pernkopf (2014) considered a hybrid model that combines a su-
pervised objective (as in the above work) and an unsupervised objective (as in
590