
(a) Learned Frey Face manifold (b) Learned MNIST manifold
Figure 4: Visualisations of learned data manifold for generative models with two-dimensional latent
space, learned with AEVB. Since the prior of the latent space is Gaussian, linearly spaced coor-
dinates on the unit square were transformed through the inverse CDF of the Gaussian to produce
values of the latent variables z. For each of these values z, we plotted the corresponding generative
p
✓
(x|z) with the learned parameters ✓.
(a) 2-D latent space (b) 5-D latent space (c) 10-D latent space (d) 20-D latent space
Figure 5: Random samples from learned generative models of MNIST for different dimensionalities
of latent space.
B Solution of D
KL
(q
(z)||p
✓
(z)), Gaussian case
The variational lower bound (the objective to be maximized) contains a KL term that can often be
integrated analytically. Here we give the solution when both the prior p
✓
(z)=N(0, I) and the
posterior approximation q
(z|x
(i)
) are Gaussian. Let J be the dimensionality of z. Let µ and
denote the variational mean and s.d. evaluated at datapoint i, and let µ
j
and
j
simply denote the
j-th element of these vectors. Then:
Z
q
✓
(z) log p(z) dz =
Z
N(z; µ,
2
) log N(z; 0, I) dz
=
J
2
log(2⇡)
1
2
J
X
j=1
(µ
2
j
+
2
j
)
10
(a) Learned Frey Face manifold (b) Learned MNIST manifold
Figure 4: Visualisations of learned data manifold for generative models with two-dimensional latent
space, learned with AEVB. Since the prior of the latent space is Gaussian, linearly spaced coor-
dinates on the unit square were transformed through the inverse CDF of the Gaussian to produce
values of the latent variables z. For each of these values z, we plotted the corresponding generative
p
✓
(x|z) with the learned parameters ✓.
(a) 2-D latent space (b) 5-D latent space (c) 10-D latent space (d) 20-D latent space
Figure 5: Random samples from learned generative models of MNIST for different dimensionalities
of latent space.
B Solution of D
KL
(q
(z)||p
✓
(z)), Gaussian case
The variational lower bound (the objective to be maximized) contains a KL term that can often be
integrated analytically. Here we give the solution when both the prior p
✓
(z)=N(0, I) and the
posterior approximation q
(z|x
(i)
) are Gaussian. Let J be the dimensionality of z. Let µ and
denote the variational mean and s.d. evaluated at datapoint i, and let µ
j
and
j
simply denote the
j-th element of these vectors. Then:
Z
q
✓
(z) log p(z) dz =
Z
N(z; µ,
2
) log N(z; 0, I) dz
=
J
2
log(2⇡)
1
2
J
X
j=1
(µ
2
j
+
2
j
)
10
Figure 13.7: Two-dimensional representation space (for easier visualization), i.e., a Eu-
clidean coordinate system for Frey faces (left) and MNIST digits (right), learned by a
variational auto-encoder (Kingma and Welling, 2014). Figures reproduced with permis-
sion from the authors. The images shown are not examples from the training set but
actually generated by the model, simply by changing the 2-D “code”. On the left, one
dimension that has been discovered (horizontal) mostly corresponds to a rotation of the
face, while the other (vertical) corresponds to the emotional expression. The decoder
deterministically maps codes (here two numbers) to images. The encoder maps images
to codes (and adds noise, during training).
Another kind of interesting illustration of manifold learning involves the discovery
of distributed representations for words. Neural language models were initiated with the
work of Bengio et al. (2001b, 2003), in which a neural network is trained to predict
the next word in a sequence of natural language text, given the previous words, and
where each word is represented by a real-valued vector, called embedding or neural word
embedding.
Figure 13.8 shows such neural word embeddings reduced to two dimensions (orig-
inally 50 or 100) using the t-SNE non-linear dimensionality reduction algorithm (van
der Maaten and Hinton, 2008). The figures zooms into different areas of the word-space
and illustrates that words that are semantically and syntactically close end up having
nearby embeddings.
260