
CHAPTER 17. REPRESENTATION LEARNING
How could unsupervised pre-training act as regularizer? Simply by imposing
an extra constraint: the learned representations should not only be consistent
with better predicting outputs y but they should also be consistent with better
capturing the variations in the input x, i.e., modeling P (x). This is associated
implicitly with a prior, i.e., that P (y|x) and P (x) share structure, i.e., that
learning about P (x) can help to generalize better on P (y|x). Obviously this needs
not be the case in general, e.g., if y is an effect of x. However, if y is a cause of
x, then we would expect this a priori assumption to be correct, as discussed at
greater length in Section 17.4 in the context of semi-supervised learning.
A disadvantage of unsupervised pre-training is that it is difficult to choose
the capacity hyperparameters (such as when to stop training) for the pre-training
phases. An expensive option is to try many different values of these hyperpa-
rameters and choose the one which gives the best supervised learning error after
fine-tuning. Another potential disadvantage is that unsupervised pre-training
may require larger representations than what would be necessarily strictly for the
task at hand, since presumably, y is only one of the factors that explain x.
Today, as many deep learning researchers and practitioners have moved to
working with very large labeled datasets, unsupervised pre-training has become
less popular in favor of other forms of regularization such as dropout – to be
discussed in section 7.11. Nevertheless, unsupervised pre-training remains an
important tool in the deep learning toolbox and should particularly be considered
when the number of labeled examples is low, such as in the semi-supervised,
domain adaptation and transfer learning settings, discussed next.
17.2 Transfer Learning and Domain Adaptation
Transfer learning and domain adaptation refer to the situation where what has
been learned in one setting (i.e., distribution P
1
) is exploited to improve general-
ization in another setting (say distribution P
2
).
In the case of transfer learning, we consider that the task is different but many
of the factors that explain the variations in P
1
are relevant to the variations that
need to be captured for learning P
2
. This is typically understood in a supervised
learning context, where the input is the same but the target may be of a different
nature, e.g., learn about visual categories that are different in the first and the
second setting. If there is a lot more data in the first setting (sampled from P
1
),
then that may help to learn representations that are useful to quickly generalize
when examples of P
2
are drawn. For example, many visual categories share low-
level notions of edges and visual shapes, the effects of geometric changes, changes
in lighting, etc. In general, transfer learning, multi-task learning (Section 7.12),
and domain adaptation can be achieved via representation learning when there
403