
CHAPTER 16. REPRESENTATION LEARNING
learning hierarchical models, https://sites.google.com/site/nips2011workshop/transfer-learning-challenge).
In the first of these competitions, the experimental setup is the following.
Each participant is first given a dataset from the first setting (from distribution
P
1
), basically illustrating examples of some set of categories. The participants
must use this to learn a good feature space (mapping the raw input to some
representation), such that when we apply this learned transformation to inputs
from the transfer setting (distribution P
2
), a linear classifier can be trained and
generalize well from very few labeled examples. Figure 16.6 illustrates one of the
most striking results: as we consider deeper and deeper representations (learned
in a purely unsupervised way from data of the first setting P
1
), the learning curve
on the new categories of the second (transfer) setting P
2
becomes much better, i.e.,
less labeled examples of the transfer tasks are necessary to achieve the apparently
asymptotic generalization performance.
An extreme form of transfer learning is one-shot learning or even zero-shot
learning or zero-data learning, where one or even zero example of the new task
are given.
One-shot learning (Fei-Fei et al., 2006) is possible because, in the learned
representation, the new task corresponds to a very simple region, such as a ball-
like region or the region around a corner of the space (in a high dimensional
space, there are exponentially many corners). This works to the extent that the
factors of variation corresponding to these invariances have been cleanly separated
from other factors, in the learned representation space, and we have somehow
learned which factors do and do not matter when discriminating objects of certain
categories.
Zero-data learning (Larochelle et al., 2008) and zero-shot learning (Richard Socher
and Ng, 2013) are only possible because additional information has been exploited
during training that provides representations of the “task” or “context”, helping
the learner figure out what is expected, even though no example of the new task
has ever been seen. For example, in a multi-task learning setting, if each task
is associated with a set of features, i.e., a distributed representation (that is al-
ways provided as an extra input, in addition to the ordinary input associated
with the task), then one can generalize to new tasks based on the similarity be-
tween the new task and the old tasks, as illustrated in Figure 16.7. One learns
a parametrized function from inputs to outputs, parametrized by the task repre-
sentation. In the case of zero-shot learning (Richard Socher and Ng, 2013), the
“task” is a representation of a semantic object (such as a word), and its repre-
sentation has already been learned from data relating different semantic objects
together (such as natural language data, relating words together). On the other
hand, for some of the tasks (e.g., words) one has data associating the variables
of interest (e.g., words, pixels in images). Thus one can generalize and associate
442