y is an effect of x. However, if y is a cause of x, then we would expect this a priori
assumption to be correct, as discussed at greater length in Section 17.4 in the context
of semi-supervised learning.
A disadvantage of unsupervised pre-training is that it is difficult to choose the ca-
pacity hyper-parameters (such as when to stop training) for the pre-training phases. An
expensive option is to try many different values of these hyper-parameters and choose
the one which gives the best supervised learning error after fine-tuning. Another poten-
tial disadvantage is that unsupervised pre-training may require larger representations
than what would be necessarily strictly for the task at hand, since presumably, y is only
one of the factors that explain x.
Today, as many deep learning researchers and practitioners have moved to working
with very large labeled datasets, unsupervised pre-training has become less popular in
favor of other forms of regularization such as dropout – to be discussed in section 7.11.
Nevertheless, unsupervised pre-training remains an important tool in the deep learning
toolbox and should particularly be considered when the number of labeled examples is
low, such as in the semi-supervised, domain adaptation and transfer learning settings,
discussed next.
17.2 Transfer Learning and Domain Adaptation
Transfer learning and domain adaptation refer to the situation where what has been
learned in one setting (i.e., distribution P
1
) is exploited to improve generalization in
another setting (say distribution P
2
).
In the case of transfer learning, we consider that the task is different but many of the
factors that explain the variations in P
1
are relevant to the variations that need to be
captured for learning P
2
. This is typically understood in a supervised learning context,
where the input is the same but the target may be of a different nature, e.g., learn
about visual categories that are different in the first and the second setting. If there
is a lot more data in the first setting (sampled from P
1
), then that may help to learn
representations that are useful to quickly generalize when examples of P
2
are drawn. For
example, many visual categories share low-level notions of edges and visual shapes, the
effects of geometric changes, changes in lighting, etc. In general, transfer learning, multi-
task learning (Section 7.12), and domain adaptation can be achieved via representation
learning when there exist features that would be useful for the different settings or tasks,
i.e., there are shared underlying factors. This is illustrated in Figure 7.6, with shared
lower layers and task-dependent upper layers.
However, sometimes, what is shared among the different tasks is not the semantics
of the input but the semantics of the output, or maybe the input needs to be treated
differently (e.g., consider user adaptation or speaker adaptation). In that case, it makes
more sense to share the upper layers (near the output) of the neural network, and have
a task-specific pre-processing, as illustrated in Figure 17.5.
In the related case of domain adaptation, we consider that the task (and the optimal
input-to-output mapping) is the same but the input distribution is slightly different.
315