acceptable. Any change to the input data changes the distribution to be captured and
fundamentally alters the task. For example, multiplying the input by 0.1 will artificially
increase likelihood by 10.
Issues with preprocessing commonly arise when benchmarking generative models on
the MNIST dataset, one of the more popular generative modeling benchmarks. MNIST
consists of grayscale images. Some models treat MNIST images as points in a real
vector space, while others treat them as binary. Yet others treat the grayscale values
as probabilities for a binary samples. It is essential to compare real-valued models
only to other real-valued models and binary-valued models only to other binary-valued
models. Otherwise the likelihoods measured are not on the same space. (For the binary-
valued models, the log likelihood can be at most 0., while for real-valued models it can
be arbitrarily high, since it is the measurement of a density) Among binary models,
it is important to compare models using exactly the same kind of binarization. For
example, we might binarize a gray pixel to 0 or 1 by thresholding at 0.5, or by drawing
a random sample whose probability of being 1 is given by the gray pixel intensity.
If we use the random binarization, we might binarize the whole dataset once, or we
might draw a different random example for each step of training and then draw multiple
samples for evaluation. Each of these three schemes yields wildly different likelihood
numbers, and when comparing different models it is important that both models use
the same binarization scheme for training and for evaluation. In fact, researchers who
apply a single random binarization step share a file containing the results of the random
binarization, so that there is no difference in results based on different outcomes of the
binarization step.
Finally, in some cases the likelihood seems not to measure any attribute of the
model that we really care about. For example, real-valued models of MNIST can obtain
arbitrarily high likelihood by assigning arbitrarily low variance to background pixels
that never change. Models and algorithms that detect these constant features can reap
unlimited rewards, even though this is not a very useful thing to do. This strongly
suggests a need for developing other ways of evaluating generative models.
Although this is still an open question, this might be achieved by converting the
problem into a classification task. For example, we have seen that the NCE method
(Noise Contrastive Estimation, Section 19.6) compares the density of the training data
according to a learned unnormalized model with its density under a background model.
However, generative models do not always provide us with an energy function (equiva-
lently, an unnormalized density), e.g., deep Boltzmann machines, generative stochastic
networks, most denoising auto-encoders (that are not guaranteed to correspond to an
energy function), deep Belief networks, etc. Therefore, it would be interesting to con-
sider a classification task in which one tries to distinguish the training examples from the
generated examples. This is precisely what is achieved by the discriminator network of
generative adversarial networks (Section 21.8.4). However, it would require an expensive
operation (training a discriminator) each time one would have to evaluate performance
TODO– have a section on sum-product networks including a citation to James
Martens’ recent paper
412