
CHAPTER 20. DEEP GENERATIVE MODELS
has. Generative modeling is different because changes in preprocessing, even very
small and subtle ones, are completely unacceptable. Any change to the input data
changes the distribution to be captured and fundamentally alters the task. For
example, multiplying the input by 0.1 will artificially increase likelihood by 10.
Issues with preprocessing commonly arise when benchmarking generative models
on the MNIST dataset, one of the more popular generative modeling benchmarks.
MNIST consists of grayscale images. Some models treat MNIST images as points
in a real vector space, while others treat them as binary. Yet others treat the
grayscale values as probabilities for a binary samples. It is essential to compare
real-valued models only to other real-valued models and binary-valued models only
to other binary-valued models. Otherwise the likelihoods measured are not on the
same space. For binary-valued models, the log-likelihood can be at most zero, while
for real-valued models it can be arbitrarily high, since it is the measurement of a
density. Among binary models, it is important to compare models using exactly
the same kind of binarization. For example, we might binarize a gray pixel to 0 or 1
by thresholding at 0.5, or by drawing a random sample whose probability of being
1 is given by the gray pixel intensity. If we use the random binarization, we might
binarize the whole dataset once, or we might draw a different random example for
each step of training and then draw multiple samples for evaluation. Each of these
three schemes yields wildly different likelihood numbers, and when comparing
different models it is important that both models use the same binarization scheme
for training and for evaluation. In fact, researchers who apply a single random
binarization step share a file containing the results of the random binarization, so
that there is no difference in results based on different outcomes of the binarization
step.
Because being able to generate realistic samples from the data distribution
is one of the goals of a generative model, practitioners often evaluate generative
models by visually inspecting the samples. In the best case, this is done not by the
researchers themselves, but by experimental subjects who do not know the source
of the samples (Denton et al., 2015). Unfortunately, it is possible for a very poor
probabilistic model to produce very good samples. A common practice to verify
if the model only copies some of the training examples is illustrated in Fig. 16.1.
The idea is to show for some of the generated samples their nearest neighbor in
the training set, according to Euclidean distance in the space of
x
. The model can
overfit the training set and just reproduce training instances. It is even possible to
simultaneously underfit and overfit yet still produce samples that individually look
good. Imagine a generative model trained on images of dogs and cats that simply
learns to reproduce the training images of dogs. Such a model has clearly overfit,
because it does not produces images that were not in the training set, but it has
726