Chapter 1
Deep Learning for AI
Inventors have long dreamed of creating machines that think. Ancient Greek myths tell
of intelligent objects, such as animated statues of human beings and tables that arrive
full of food and drink when called . When programmable computers were first conceived,
people wondered whether they might become intelligent, over a hundred years before
one was built (Lovelace, 1842). Today, artificial intelligence (AI) is a thriving field with
many practical applications and active research topics. We look to intelligent software
to automate routine labor, understand speech or images, make diagnoses in medicine,
and to support basic scientific research. This book is about deep learning, an approach
to AI based on enabling computers to learn from experience and understand the world
in terms of a hierarchy of concepts, with each concept defined in terms of its relation to
simpler concepts.
Many of the early successes of AI took place in relatively sterile and formal en-
vironments and did not require computers to have much knowledge about the world.
For example, IBM’s Deep Blue chess-playing system defeated world champion Garry
Kasparov in 1997 (Hsu, 2002). Chess is of course a very simple world, containing only
sixty-four locations and thirty-two pieces that can move in only rigidly circumscribed
ways. Devising a successful chess strategy is a tremendous intellectual accomplishment,
but does not require much knowledge about the agent’s environment. The environment
can be described by a very brief list of rules, easily provided ahead of time by the
programmer.
Ironically, abstract and formal tasks such as chess that are among the most difficult
mental undertakings for a human being are among the easiest for a computer. A person’s
everyday life requires an immense amount of knowledge about the world, and much
of this knowledge is subjective and intuitive, and therefore difficult to articulate in a
formal way. Yet, computers require some form of knowledge in order to make intelligent
decisions. Where is that knowledge going to come from?
Several artificial intelligence projects have sought to hard-code knowledge about the
world in formal languages. A computer can reason about statements in these formal
languages automatically using logical inference rules. None of these projects has lead
to a major success. One of the most famous such projects is Cyc. Cyc is an inference
engine and a database of statements in a language called CycL. These statements are
2
entered by a staff of human supervisors. It is an unwieldy process. People struggle
to devise formal rules with enough complexity to accurately describe the world. For
example, Cyc failed to understand a story about a person named Fred shaving in the
morning (Linde, 1992). Its inference engine detected an inconsistency in the story—it
knew that people do not have electrical parts, but because Fred was holding an electric
razor, it believed the entity “FredWhileShaving” contained electrical parts. It therefore
asked whether Fred was still a person while he was shaving.
The difficulties faced by systems relying on hard-coded knowledge suggest that AI
systems need the ability to acquire their own knowledge, by extracting patterns from
raw data. This capability is known as machine learning. The introduction of machine
learning allowed computers to tackle problems involving knowledge of the real world
and make decisions that appear subjective. A simple machine learning algorithm called
logistic regression can determine whether to recommend cesarean delivery (Mor-Yosef
et al., 1990). A simple machine learning algorithm called naive Bayes can separate
legitimate e-mail from spam e-mail. What we call a learning machine or more generally
learner is the agent that executes the learning procedure, that takes training data as
input and yields a change in the agent (or mathematically, a function).
The performance of these simple machine learning algorithms depends heavily on
the representation of the data they are given. For example, when logistic regression
is used to recommend cesarean delivery, the AI system does not examine the patient
directly. Instead, the doctor tells the system several pieces of relevant information, such
as the presence or absence of a uterine scar. Each piece of information included in the
representation of the patient is known as a feature. Logistic regression learns how each
of these features of the patient correlates with various outcomes. However, it cannot
learn what features are useful, nor can it observe the features itself. If logistic regression
was given a 3-D MRI image of the patient, rather than the doctor’s formalized report,
it would not be able to make useful predictions. Individual voxels
1
in an MRI scan have
negligible correlation with any complications that might occur during delivery.
This dependence on representations is a general phenomenon that appears through-
out computer science and even daily life. In computer science, operations such as search-
ing a collection of data can proceed exponentially faster if the collection is structured
and indexed intelligently. People can easily perform arithmetic on Arabic numerals, but
find arithmetic on Roman numerals much more time consuming. It is not surprising
that the choice of representation has an enormous effect on the performance of machine
1
A voxel is the value at a single point in a 3-D scan, much as a pixel as the value at a single point
in an image.
3
learning algorithms.
Example of representation
Data can be represented in different ways, but some representations make it
easier for machine learning algorithms to capture the knowledge they provide.
For example, a number can be represented by its binary encoding (with n bits),
by a single real-valued scalar or by its one-hot encoding (with 2
n
bits of which
only one is turned on). In many cases, the compact binary representation is
a poor choice for learning algorithms, because two very nearby values (like 3,
encoded as binary 00000011, and 4, encoded as binary 00000100) have no digits
in common while two values that are very different (like binary 10000001 = 129
and binary 00000001 = 1) only differ by one digit. This makes it difficult for
the learning machine to generalize from examples to numerically close ones.
However, in many application we expect that what is true for input x is often
true for input x + for a small . This is called the smoothness prior and is ex-
ploited in most applications of machine learning that involve real numbers, and
to some extent other data types in which some meaningful notion of similarity
can be defined.
Many artificial intelligence tasks can be solved by designing the right set of features
to extract for that task, then providing these features to a simple machine learning
algorithm. For example, a useful feature for speaker identification from sound is the
pitch. The pitch can be formally specified—it is the lowest frequency major peak of the
spectrogram. It is useful for speaker identification because it is determined by the size
of the vocal tract, and therefore gives a strong clue as to whether the speaker is a man,
woman, or child.
However, for many tasks, it is difficult to know what features should be extracted. For
example, suppose that we would like to write a program to detect cars in photographs.
We know that cars have wheels, so we might like to use the presence of a wheel as a
feature. Unfortunately, it is difficult to describe exactly what a wheel looks like in terms
of pixel values. A wheel has a simple geometric shape but its image may be complicated
by shadows falling on the wheel, the sun glaring off the metal parts of the wheel, the
fender of the car or an object in the foreground obscuring part of the wheel, and so on.
One solution to this problem is to use machine learning to discover not only the map-
ping from representation to output but also the representation itself. This approach is
known as representation learning. Learned representations often result in much better
performance than can be obtained with hand-designed representations. They also al-
low AI systems to rapidly adapt to new tasks, with minimal human intervention. A
representation learning algorithm can discover a good set of features for a simple task
in minutes, or a complex task in hours to months. Manually designing features for a
complex task requires a great deal of human time and effort; it can take decades for an
entire community of researchers.
When designing features or algorithms for learning features, our goal is usually to
4
separate the factors of variation that explain the observed data. In this context, we use
the word “factors” simply to refer to separate sources of influence; the factors are usually
not combined by multiplication. Such factors are often not quantities that are directly
observed but they exist in the minds of humans as explanations or inferred causes of
the observed data. They can be thought of as concepts or abstractions that help us
make sense of the rich variability in the data. When analyzing a speech recording, the
factors of variation include the speaker’s age and sex, their accent, and the words that
they are speaking. When analyzing an image of a car, the factors of variation include
the position of the car, its color, and the angle and brightness of the sun.
A major source of difficulty in many real-world artificial intelligence applications is
that many of the factors of variation influence every single piece of data we are able to
observe. The individual pixels in an image of a red car might be very close to black at
night. The shape of the car’s silhouette depends on the viewing angle. Most applications
require us to disentangle the factors of variation and discard the ones that we do not
care about.
Of course, it can be very difficult to extract such high-level, abstract features from
raw data. Many of these factors of variation, such as a speaker’s accent, also require
sophisticated, nearly human-level understanding of the data. When it is nearly as diffi-
cult to obtain a representation as to solve the original problem, representation learning
does not, at first glance, seem to help us.
Deep learning solves this central problem in representation learning by introducing
representations that are expressed in terms of other, simpler representations. Deep
learning allows the computer to build complex concepts out of simpler concepts. Fig. 1.1
shows how a deep learning system can represent the concept of an image of a person by
combining simpler concepts, such as corners and contours, which are in turn defined in
terms of edges.
Another perspective on deep learning is that it allows the computer to learn a multi-
step computer program. Each layer of the representation can be thought of as the
state of the computer’s memory after executing another set of instructions in parallel.
Networks with greater depth can execute more instructions in sequence. Being able
to execute instructions sequentially offers great power because later instructions can
refer back to the results of earlier instructions. According to this view of deep learning,
not all of the information in a layer’s representation of the input necessarily encodes
factors of variation that explain the input. The representation is also used to store state
information that helps to execute a program that can make sense of the input.
“Depth” is not a mathematically rigorous term in this context; there is no formal
definition of deep learning. All approaches to deep learning share the idea of nested
representations of data, but different approaches view depth in different ways. For some
approaches, the depth of the system is the depth of the flowchart describing the com-
putations needed to produce the final representation. The depth corresponds roughly
to the number of times we update the representation. Other approaches consider depth
to be the depth of the graph describing how concepts are related to each other. In this
case, the depth of the flow-chart of the computations needed to compute the represen-
tation of each concept may be much deeper than the graph of the concepts themselves.
5
Visualizing and Understanding Convolutional Networks
(a)
(b)
(c) (d)
(e)
Figure 6. (a): 1st layer features without feature scale clipping. Note that one feature dominates. (b): 1st layer features
from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and filter size (7x7 vs 11x11)
results in more disti n c ti ve features and fewer “dead” features. (d): Visualizations of 2nd layer features from (Krizhevsky
et al., 2012) . (e): Visualizatio n s of our 2nd layer featu res. These are cleaner, with no aliasing artifacts that are visible in
(d).
Car wheel
Racer
Cab
Police van
Pomeranian
Tennis ball
Keeshond
Pekinese
Afghan hound
Gordon setter
Irish setter
Mortarboard
Fur coat
Academic gown
Australian terrier
Ice lolly
Vizsla
Neck brace
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.05
0.1
0.15
0.2
0.25
True Label: Pomeranian
(a) Input Image (b) Layer 5, strongest feature map
(c) Layer 5, strongest
feature map projections
(d) Classier, probability
of correct class
(e) Classier, most
probable class
True Label: Car Wheel
True Label: Afghan Hound
Figure 7. Three test examples where we systematically cover up dierent portions of the scene w it h a g ray square (1st
column) and see how the top (layer 5) feature maps ((b) & (c)) and classifier outpu t ((d ) & (e)) changes . (b): for each
position of the g ray scale, we record the tota l activation in on e layer 5 feature map (the one with the s tro n g es t response
in the unoccluded image). (c): a visualization of this fea tu re map projected down into the input image (black square),
along with visu a li z a ti o n s of t h is map from other images. The rst row example shows the strongest feature to be the
dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). (d): a map of correct
class probability, as a func t io n of the positio n of the gray square. E.g. when the do g ’s face is obscured, t h e probability
for “pomeranian” drop s s ig n i fi c antly. (e): the most proba b l e la bel as a fun c t io n of occluder position. E.g. in the 1st row,
for most location s it is “pomeranian” , but if the dog’s face is obsc u red but not the ball, then it predicts “tenn i s ball”. In
the 2nd example, text on th e car is the stro n g es t feat u re in layer 5, but the class ifi er is most sens it ive to the whee l. The
3rd examp l e contains multiple o bjects. The strongest feature in layer 5 picks out the faces, but the classifier is sensitive
to the dog (blue region in (d)), since it uses multiple feature maps.
Visualizing and Understanding Convolutional Networks
(a)
(b)
(c) (d)
(e)
Figure 6. (a): 1st layer features without feature scale clipping. Note that one feature dominates. (b): 1st layer features
from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and filter size (7x7 vs 11x11)
results in more distin c t ive features and fewer “dead” features. (d): Visualizations of 2nd layer features from (Krizhevsky
et al., 2012) . (e): Visualization s of our 2nd layer features . These a re cleaner, with no aliasing art i fa ct s that are visible in
(d).
Car wheel
Racer
Cab
Police van
Pomeranian
Tennis ball
Keeshond
Pekinese
Afghan hound
Gordon setter
Irish setter
Mortarboard
Fur coat
Academic gown
Australian terrier
Ice lolly
Vizsla
Neck brace
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.05
0.1
0.15
0.2
0.25
True Label: Pomeranian
(a) Input Image (b) Layer 5, strongest feature map
(c) Layer 5, strongest
feature map projections
(d) Classier, probability
of correct class
(e) Classier, most
probable class
True Label: Car Wheel
True Label: Afghan Hound
Figure 7. Three test examples where we systematically cover up d ierent portions of the scene with a gray square (1st
column) and see how the top (layer 5) feature maps ((b) & (c)) and classifier output ((d) & (e)) changes. (b): for each
position of the g ray s ca l e, we record the total activation in one layer 5 featu re map (the o n e w it h the strong es t response
in the u n occluded image). (c): a visualization of this feature map projected down into the i n p u t image (black square),
along with visu a li za t io n s of t h is map f ro m other images. The rst row example shows the strongest feature to be the
dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). (d ): a map of correct
class probability, as a funct io n of the positio n of the gray square. E.g. when the dog’s fa c e is obscured, the probability
for “pomeranian” drop s si g n ifi c a ntly. (e): the most probable l abel as a fun c t io n of occluder position. E.g. in the 1st row,
for most location s it is “pomeranian”, but if the dog’s face is obscu red but not the ball, then it pre d ic t s tennis ba ll ”. In
the 2nd example, text on the car is the stron g e st feat u re in layer 5, but the classifi e r is most sensiti ve to the wheel. The
3rd examp l e contains mul t ip l e objects. The stronges t feature in layer 5 picks out the faces , but the c l a ssi fi er is sensit ive
to the dog (blue region in (d)), since it uses multiple feature maps.
Visualizing and Understanding Convolutional Networks
(a)
(b)
(c) (d)
(e)
Figure 6. (a): 1st layer features without feature scale clipping. Note that one feature dominates. (b): 1st layer features
from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and filter size (7x7 vs 11x11)
results in more distin c t ive features and fewer “dead” features. (d): Visualizations of 2nd layer features from (Krizhevsky
et al., 2012) . (e): Visualization s of our 2nd layer features . These a re cleaner, with no aliasing art i fa ct s that are visible in
(d).
Car wheel
Racer
Cab
Police van
Pomeranian
Tennis ball
Keeshond
Pekinese
Afghan hound
Gordon setter
Irish setter
Mortarboard
Fur coat
Academic gown
Australian terrier
Ice lolly
Vizsla
Neck brace
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.05
0.1
0.15
0.2
0.25
True Label: Pomeranian
(a) Input Image (b) Layer 5, strongest feature map
(c) Layer 5, strongest
feature map projections
(d) Classier, probability
of correct class
(e) Classier, most
probable class
True Label: Car Wheel
True Label: Afghan Hound
Figure 7. Three test examples where we systematically cover up d ierent portions of the scene with a gray square (1st
column) and see how the top (layer 5) feature maps ((b) & (c)) and classifier output ((d) & (e)) changes. (b): for each
position of the g ray s ca l e, we record the total activation in one layer 5 featu re map (the o n e w it h the strong es t response
in the u n occluded image). (c): a visualization of this feature map projected down into the i n p u t image (black square),
along with visu a li za t io n s of t h is map f ro m other images. The rst row example shows the strongest feature to be the
dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). (d ): a map of correct
class probability, as a funct io n of the positio n of the gray square. E.g. when the dog’s fa c e is obscured, the probability
for “pomeranian” drop s si g n ifi c a ntly. (e): the most probable l abel as a fun c t io n of occluder position. E.g. in the 1st row,
for most location s it is “pomeranian”, but if the dog’s face is obscu red but not the ball, then it pre d ic t s tennis ba ll ”. In
the 2nd example, text on the car is the stron g e st feat u re in layer 5, but the classifi e r is most sensiti ve to the wheel. The
3rd examp l e contains mul t ip l e objects. The stronges t feature in layer 5 picks out the faces , but the c l a ssi fi er is sensit ive
to the dog (blue region in (d)), since it uses multiple feature maps.
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a fully trained model. For layers 2-5 we show the top 9 activati o n s in a ran do m subset
of fea tu re maps across the validation dat a , projected down to pixel space using our deco nvo l u ti o n a l network a p p ro a ch.
Our rec o n st ru c t io n s are not samples from the model: they are reconstructed patterns from the validation set that cause
high activations in a given feature map. For each feature map we also show the corres ponding imag e patches. Note:
(i) the the strong grouping within each feature map, (ii) g rea t er invariance at higher layers and (iii) exaggera t io n of
discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form.
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a fully trained model. For layers 2-5 we show the top 9 activations in a random subset
of feat u re maps a cro s s the validation d a ta , projected down to pixel spa ce using our deconvolution a l network ap p ro a ch.
Our rec on s t ru ct i on s are not samples from the model: they are reconstructed patterns from the validation set that cause
high activations in a given fea t u re map. For each feature map we a ls o show the corresponding image patches. Note:
(i) the the s tro n g group i n g with in each feature map, (ii) g rea t er invariance at higher layers and (iii) exaggeration of
discriminative parts o f the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form.
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a fully trained model. For layers 2-5 we show the top 9 activations in a random subset
of feat u re maps a cro s s the validation d a ta , projected down to pixel spa ce using our deconvolution a l network ap p ro a ch.
Our rec on s t ru ct i on s are not samples from the model: they are reconstructed patterns from the validation set that cause
high activations in a given fea t u re map. For each feature map we a ls o show the corresponding image patches. Note:
(i) the the s tro n g group i n g with in each feature map, (ii) g rea t er invariance at higher layers and (iii) exaggeration of
discriminative parts o f the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form.
Visualizing and Understanding Convolutional Networks
Figure 2. Vi su a l iz a t io n of fea t u res in a fully trained model. For layers 2-5 we show the to p 9 activation s in a rando m subset
of feat u re ma p s acros s the va li d at i o n data , projec t ed down to pixel s p a c e using our de co nvol u ti o n a l n etwork appro a ch.
Our re co n s tru c t io n s are not samples from t h e model: they are recon s t ruct ed patt ern s from t h e valida ti o n set t h a t cause
high a c t ivation s in a given featu re m a p . For each feature map we also show the correspondin g image p a t ches. Note:
(i) the the strong group in g wit h in each feature map, (ii) greater inva ri an c e at high er layers and (iii) exaggera ti o n of
discrim in a t ive parts o f the image, e.g . eyes and noses of dog s (layer 4, row 1, cols 1 ) . Bes t viewed in electron ic form.
Visible layer
(input pixels)
1st hidden layer
(edges)
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a ful ly trained model. For layers 2-5 we show the top 9 activations in a random subset
of feature maps across the validation data, projected down to pixel space using our deconvolutional network approach.
Our reconstructions are not samples from the model: they are reconstructed patte rn s from the validation set that cause
high activations in a given feature map. For each feature map we also show the corresponding image patches. Note:
(i) the the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exaggeration of
discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in el ec tro n i c form.
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a ful ly trained model. For layers 2-5 we show the top 9 activations in a random subset
of feature maps across the validation data, projected down to pixel space using our deconvolutional network approach.
Our reconstructions are not samples from the model: they are reconstructed patte rn s from the validation set that cause
high activations in a given feature map. For each feature map we also show the corresponding image patches. Note:
(i) the the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exaggeration of
discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in el ec tro n i c form.
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a ful ly trained model. For layers 2-5 we show the top 9 activations in a random subset
of feature maps across the validation data, projected down to pixel space using our deconvolutional network approach.
Our reconstructions are not samples from the model: they are reconstructed patte rn s from the validation set that cause
high activations in a given feature map. For each feature map we also show the corresponding image patches. Note:
(i) the the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exaggeration of
discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in el ec tro n i c form.
2nd hidden layer
(corners and
contours)
3rd hidden layer
(object parts)
CAR PERSON ANIMAL
Output
(object identity)
Figure 1.1: Illustration of a deep learning model. It is difficult for a computer to un-
derstand the meaning of raw sensory input data, such as this image represented as a
collection of pixel values. The function mapping from a set of pixels to an object identity
is very complicated. Learning or evaluating this mapping seems insurmountable if tack-
led directly. Deep learning resolves this difficulty by breaking the desired complicated
mapping into a series of nested simple mappings, each described by a different layer of
the model. The input is presented at the visible layer. Then a series of hidden layers ex-
tracts increasingly abstract features from the image. The images here are visualizations
of the kind of feature represented by each hidden unit. Given the pixels, the first layer
can easily identify edges, by comparing the brightness of neighboring pixels. Given the
first hidden layer’s description of the edges, the second hidden layer can easily search for
corners and extended contours, which are recognizable as collections of edges. Given the
second hidden layer’s description of the image in terms of corners and contours, the third
hidden layer can detect entire parts of specific objects, by finding specific collections of
contours and corners. Finally, this description of the image in terms of the object parts
it contains can be used to recognize the objects present in the image. Images provided
by Zeiler and Fergus (2014).
6
AI
Machine learning
Representation learning
Deep learning
Example:
Knowledge
bases
Example:
Logistic
regression
Example:
Autoencoders
Example:
MLPs
Figure 1.2: A Venn diagram showing how deep learning is a kind of representation
learning, which is in turn a kind of machine learning, which is used for many but not
all approaches to AI. Each section of the Venn diagram includes an example of an AI
technology.
This is because the system’s understanding of the simpler concepts can be refined given
information about the more complex concepts. For example, an AI system observing an
image of a face with one eye in shadow may initially only see one eye. After detecting
that a face is present, it can then infer that a second eye is probably present as well.
To summarize, deep learning, the subject of this book, is an approach to AI. Specif-
ically, it is a type of machine learning, a technique that allows computer systems to
improve with experience and data. According to the authors of this book, machine
learning is the only viable approach to building AI systems that can operate in compli-
cated, real-world environments. Deep learning is a particular kind of machine learning
that achieves great power and flexibility by learning to represent the world as a nested
hierarchy of concepts, with each concept defined in relation to simpler concepts. Fig. 1.2
illustrates the relationship between these different AI disciplines. Fig. 1.3 gives a high-
level schematic of how each works.
7
Figure 1.3: Flow-charts showing how the different parts of an AI system relate to each
other within different AI disciplines. Shaded boxes indicate components that are able
to learn from data.
8
Deep learning is the subject of this book. It involves learning multiple levels of
representation, corresponding to different levels of abstractions. In the past five years,
research on deep learning has had a tremendous impact, both at an academic level and
in terms of industrial breakthroughs.
Representation learning algorithms can either be supervised, unsupervised, or a
combination of both (semi-supervised). These notions are explained in more detail in
Chapter 5, but we introduce them briefly here. Supervised learning requires examples
that include both an input and a target output, the latter being generally interpreted
as what we would have liked the learner to produce as output, given that input. Such
examples are called labeled examples because the target output often comes from a hu-
man providing that “right answer”. Manual labeling can be tedious and expensive, and
there is relatively much more unlabeled data than labeled data. Unsupervised learning
allows a learner to capture statistical dependencies present in unlabeled data, while
semi-supervised learning combines labeled examples and unlabeled examples. The fact
that several deep learning algorithms can take advantage of unlabeled examples can be
an important advantage, discussed much in this book, in particular in Chapter 10 as
well as in Section 1.5 of this introductory chapter.
Deep learning has not only changed the field of machine learning and influenced our
understanding of human perception, it has revolutionized areas of application such as
speech recognition and image understanding. Companies such as Google, Microsoft,
Facebook, IBM, NEC, Baidu and others have all deployed products and services based
on deep learning methods, and set up research groups to take advantage of deep learn-
ing. Deep learning was used to win international competitions in object recognition
2
. Deep unsupervised learning (trying to capture the input distribution, and relying
less on having many labeled examples) performs exceptionally well in a transfer con-
text, when the model has to be applied on a test distribution that is a bit different
from the training distribution
3
, e.g., involving new categories. With large training sets,
deep supervised learning has been the most impressive, as recently shown with the out-
standing breakthrough achieved by Geoff Hinton’s team on the ImageNet object recogni-
tion 1000-class benchmark, bringing down the state-of-the-art error rate from 26.1% to
15.3% (Krizhevsky et al., 2012a). Since then, these competitions are consistently won by
deep convolutional nets, and as of this writing, advances in deep learning had brought
this even further down to 6.5%, using even deeper networks (Szegedy et al., 2014). On
another front, whereas speech recognition error rates kept decreasing in the 90’s (thanks
mostly to better systems engineering, larger datasets, and larger HMM models), perfor-
mance of speech recognition systems had stagnated in the 2000-2010 decade, until the
2
J¨urgen Schmidhuber’s lab at IDSIA has won many such competitions, see http://www.idsia.ch/
~
juergen/deeplearning.html, and Hinton’s U. Toronto group and Fergus’ NYU group have respec-
tively won the ImageNet competitions in 2012 and 2013(Krizhevsky et al., 2012a), see http://www.
image-net.org/challenges/LSVRC/2012/ and http://www.image-net.org/challenges/LSVRC/2013/.
We have used deep learning to win an international competition in computer vision focused on the
detection of emotional expression from videos and audio (Ebrahimi et al., 2013).
3
Teams from Yoshua Bengio’s lab won the Transfer Learning Challenge (results at ICML 2011 work-
shop) (Mesnil et al., 2011) and the NIPS’2011 workshop Transfer Learning Challenge (Goodfellow et al.,
2011).
9
advent of deep learning (Hinton et al., 2012a). Since then, thanks to large and deep
architectures, the error rates on the well-known Switchboard benchmark have dropped by
about half! As a consequence, most of the major speech recognition systems (Microsoft,
Google, IBM, Apple) have incorporated deep learning, which has become a de facto
standard at conferences such as ICASSP.
1.1 Who should read this book?
This book can be useful for a variety of readers, but the main target audiences are the
university students (undergraduate or graduate) learning about machine learning and
the engineers and practitioners of machine learning, artificial intelligence, data-mining
and data science aiming to better understand and take advantage of deep learning. Ma-
chine learning is successfully applied in many areas, including computer vision, natural
language processing, robotics, speech and audio processing, bioinformatics, video-games,
search engines, online advertising and many more. Deep learning has been most suc-
cessful in traditional AI applications but is expanding in other areas, such as modeling
molecules, customers or web pages. In addition, deep learning is moving out of its early
territory of pattern recognition (in speech and images) and into natural language pro-
cessing and tasks with complex outputs, e.g. with ongoing breakthroughs in machine
translation (Hermann and Blunsom, 2014; Devlin et al., 2014; Sutskever et al., 2014;
Bahdanau et al., 2014).
Knowledge of basic concepts in machine learning will be very helpful to absorb the
concepts in this book, although the book will attempt to explain these concepts intu-
itively (and sometimes formally) when needed. Similarly, knowledge of basic concepts
in probability, statistics, calculus, linear algebra, and optimization will be very useful,
although the book will briefly explain the required concepts as needed, in particular with
Chapters 2, 3 and 4. Knowledge in computer science and familiarity with programming
will be mostly useful to understand and modify the code provided in the practical ex-
ercises associated with this book, in the Python language and based on the Pylearn2
machine learning and deep learning library, which is dedicated to rapidly prototyping
new algorithms and sharing research results.
Since much science remains to be done in deep learning, many practical aspects
of these algorithms can be seen as tricks, practices that have been found to work (at
least in some contexts) while we do not have complete explanatory theories for them.
This book will also spell out these practical guidelines, although the reader is invited to
question them and even to figure out the reasons of their successes and failures. A live
online resource http://www.deeplearning.net/book/guidelines allows practitioners
and researchers to share their questions and experience and keep abreast of developments
in the art of deep learning. Keep in mind that science is not frozen but evolving, because
we dare question the established wisdom, and readers are invited to contribute to this
exciting expansion and clarification of our knowledge.
10
1.2 Machine Learning
This section introduces a few machine learning concepts, while a deeper treatment of
this branch of knowledge at the intersection of artificial intelligence and statistics can be
found in Chapter 5. A machine learning algorithm (or learner), sees training examples,
each of which can be thought of as an assignment of values to some variables (which we
call random variables
4
). A learner uses these examples to build a function or a model
from which it can answer questions about these random variables or perform some useful
computation on their values. By analogy, human brains (whose learning algorithm we
do not perfectly understand but would like to decipher) see as training examples the
sequence of experiences of their life, and the observed random variables are what arrives
at their senses as well as the internal reinforcement signals (such as pain or pleasure)
that evolution has programmed in us to guide our learning process towards survival and
reproduction. Human brains also observe their own actions, which influence the world
around them, and it appears that human brains try to learn the statistical dependencies
between these actions and their consequences, so as to maximize future rewards.
What we call a configuration of variables is an assignment of values to the vari-
ables. For example if we have 10 binary variables then there are 2
10
possible config-
urations of their values. The crux of what a learner needs to do is to guess which
configurations of the variables of interest
5
are most likely to occur again. And the fun-
damental challenge involved is the so-called curse of dimensionality, i.e., the fact
that the number of possible configurations of these variables could be astronomical,
i.e., exponential in the number of variables involved (which could be for example the
thousands of pixels in an image). Indeed, each observed example only tells the learner
about one such configuration. To make a prediction on configurations never seen before,
the learner can only make a guess. How could humans, animals or machines possibly
have a preference for some of these unseen configurations, after seeing some positive
examples, a very small fraction of all the possible configurations? If the only informa-
tion available to make this guess comes from the examples, then not much could be said
about new unseen configurations except that they should be less likely than the observed
ones. However, every machine learning algorithm incorporates not just the information
from examples but also some priors, which can be understood as knowledge about the
world that has been built up from previous experience, either personal experience of the
learner, or from the prior experience of other learners, e.g., through biological or cultural
evolution, in the case of humans. These priors can be combined with the given training
examples to form better predictions. Bayesian machine learning attempts to formalize
these priors as probability distributions and once this is done, Bayes theorem and the
laws of probability (discussed in Chapter 3) dictates what the right predictions should
4
Informally, a random variable is a variable which can take different values, with some uncertainty
about which value will be taken, i.e., its value is not perfectly predictable.
5
At the lowest level, the variables of interest would be the externally observed variables (the sensor
readings both from the outside world and from the body), actions and rewards, but we will see in this
book that it can be advantageous to model the statistical structure by introducing internal variables or
internal representations of the observed variables.
11
be. However, the required calculations are generally intractable and it is not always clear
how to formalize all our priors in this way. Many of the priors associated with different
learning algorithms are implicit in the computations performed, which can sometimes
be viewed as approximations aimed at reconciling the information in the priors and the
information in the data, in a way that is computationally tractable. Indeed, machine
learning needs to deal not just with priors but also with computation. A practical learn-
ing algorithm must not just make good predictions but also do it quickly enough, with
reasonable computing resources spent during training and at the time of making and
acting on decisions.
The best studied machine learning task is the problem of supervised classifica-
tion. The examples are configurations of input variables along with a target category.
The learner’s objective is typically to classify new input configurations, i.e., to predict
which of the categories is most likely to be associated with the given input values. A
central concept in machine learning is generalization: how good are these guesses
made on new configurations of the observed variables? How many classification errors
would a learner make on new examples? A learner that generalizes well on a given dis-
tribution makes good guesses about new configurations, after having seen some training
examples. This concept is related to another basic concept: capacity. Capacity is a
measure of the flexibility of the learner, essentially the number of training examples
that it could always learn perfectly. Machine learning theory has traditionally focused
on the relationship between generalization and capacity. Overfitting occurs when ca-
pacity is too large compared to the number of examples, so that the learner does a
good job on the training examples (it correctly guesses that they are likely configu-
rations) but a very poor one on new examples (it does not discriminate well between
the likely configurations and the unlikely one). Underfitting occurs when instead the
learner does not have enough capacity, so that even on the training examples it is not
able to make good guesses: it does not manage to capture enough of the information
present in the training examples, maybe because it does not have enough degrees of
freedom to fit all the training examples. Whereas theoretical analysis of overfitting is
often of a statistical nature (how many examples do I need to get a good generaliza-
tion with a learner of a given capacity?), underfitting has been less studied because it
often involves the computational aspect of machine learning. The main reason we get
underfitting (especially with deep learning) is not that we choose to have insufficient
capacity but because obtaining high capacity in a learner that has strong priors often
involves difficult numerical optimization. Numerical optimization methods attempt
to find a configuration of some variables (often called parameters, in machine learning)
that minimizes or maximizes some given function of these parameters, which we call an
objective function or training criterion. During training, one typically iteratively
modifies the parameters so as to gradually minimize the training criterion (for exam-
ple the classification error). At each step of this adaptive process, the learner changes
slightly its parameters so as to make better guesses on the training examples. This of
course does not in general guarantee that future guesses on novel test examples will
be good, i.e., we could be in an overfitting situation. In the case of most deep learning
algorithms, this difficulty in optimizing the training criterion is related to the fact that
12
it is typically not convex in the parameters of the model. It is even the case that for
many models (such as most neural network models), obtaining the optimal parameters
can be computationally intractable (in general requiring computation that grows ex-
ponentially with the number of parameters). It means that approximate optimization
methods (which are often iterative) must be used, and such methods often get stuck in
what appear to be local minima of the training criterion
6
, whereas one would hope
to find the best possible solution, i.e., a global minimum. We believe that the issue
of underfitting is central in deep learning algorithms and deserves a lot more attention
from researchers.
Another central concept in modern machine learning is probability theory. Be-
cause the data gives us information about random variables, probability theory pro-
vides a natural language to describe their uncertainty and many learning algorithms
(including most described in this book) are formalized as means to capture a proba-
bility distribution over the observed variables. The probability assigned by a learner
to a configuration quantifies how likely it is to encounter that configuration of vari-
ables. The classical means of training a probabilistic model involve the definition of the
model as a family of probability functions indexed by some parameters, and the use of
the maximum likelihood criterion
7
(or a variant that incorporates some priors) to
define towards what objective to optimize these parameters. Unfortunately, for many
probabilistic models of interest (that have enough capacity and expressive power), it
is computationally intractable to maximize the likelihood exactly, and even computing
its gradient
8
is generally intractable. This has led to a variety of practical learning
algorithms with different ways to bypass these obstacles, many of which are described
in this book.
Another machine learning concept that turns out to be important to understand
many deep learning algorithms is that of manifold learning. The manifold learning
hypothesis (Cayton, 2005; Narayanan and Mitter, 2010) states that probability is con-
centrated around regions called manifolds, i.e., that most configurations are unlikely and
that probable configurations are neighbors of other probable configurations. We define
the dimension of a manifold as the number of independent types of changes (e.g. or-
thogonal directions) by which one can move and stay among probable configurations.
This hypothesis of probability concentration seems to hold for most AI tasks of interest,
as can be verified by the fact that most configurations of input variables are unlikely
(pick pixel values randomly and you will almost never obtain a natural-looking image).
The manifold hypothesis also states that small changes (e.g. translating an input image)
tend to leave unchanged categorical variables (e.g., object identity) and that there are
much fewer such local degrees of freedom (manifold dimensions) than the overall input
dimension (the number of observed variables). The associated natural clustering hy-
6
A local minimum is a configuration of the parameters that cannot be improved by small changes
in the parameters, so if the optimization procedure is iterative and operates by small changes, training
appears stuck and unable to progress to a globally optimal solution.
7
which is simply the probability that the model assigns to the whole training set
8
the direction in which parameters should be changed in order to slightly improve the training
criterion
13
pothesis assumes that different classes correspond to different manifolds, well-separated
by vast zones of low probability. These ideas turn out to be very important to un-
derstand the basic concept of representation associated with deep learning algorithms,
which may be understood as a way to specify a coordinate system along these manifolds,
as well as telling to which manifold the example belongs. Additionally, these manifold
learning ideas turn out as well to be important for understanding the mechanisms by
which regularized auto-encoders capture both the unknown manifold structure of the
data (Chapter 13), and the underlying data generating distribution (Chapter 17.9).
1.3 Historical Perspective and Neural Networks
Modern deep learning research takes a lot of its inspiration from neural network research
of previous decades. Other major intellectual sources of concepts found in deep learning
research include works on probabilistic modeling and graphical models, as well as works
on manifold learning.
The starting point of the story, though is the Perceptron and the Adaline, inspired
by knowledge of the biological neuron: simple learning algorithms for artificial neu-
ral networks were introduced around 1960 (Rosenblatt, 1958; Widrow and Hoff, 1960;
Rosenblatt, 1962), leading to much research and excitement. However, after the initial
enthusiasm, research progress reached a plateau due to the inability of these simple
learning algorithms to learn representations (i.e., in the case of an artificial neural net-
work, to learn what the intermediate layers of artificial neurons - called hidden layers -
should represent). This limitation to learning linear functions of fixed features led to a
strong reaction (Minsky and Papert, 1969) and the dominance of symbolic computation
and expert systems as the main approaches to AI in the late 60’s, 70’s and early 80’s.
In the mid-1980’s, a revival of neural network research took place thanks to the back-
propagation algorithm for learning one or more layers of non-linear features (Rumelhart
et al., 1986a; LeCun, 1987). This second bloom of neural network research took place
in the decade up to the mid-90’s, at which point many overly strong claims were made
about neural nets, mostly with the objective to attract investors and funding. In the
90’s and 2000’s, other approaches to machine learning dominated the field, especially
those based on kernel machines (Boser et al., 1992; Cortes and Vapnik, 1995; Scolkopf
et al., 1999) and those based on probabilistic approaches, also known as graphical mod-
els (Jordan, 1998). In practical applications, simple linear models with labor-intensive
design of hand-crafted features dominated the field. Kernel machines were found to per-
form as well or better while being more convenient to train (with less hyper-parameters,
i.e., knobs to tune by hand) and affording easier mathematical analysis coming from
convexity of the training criterion: by the end of the 1990’s, the machine learning com-
munity had largely abandoned artificial neural networks in favor of these more limited
methods despite the impressive performance of neural networks on some tasks (LeCun
et al., 1998a; Bengio et al., 2001a).
In the early years of this century, the groups at Toronto, Montreal, and NYU (and
shortly after, Stanford) worked together under a Canadian long-term research initiative
14
(the Canadian Institute For Advanced Research CIFAR) to break through two of
the limitations of old-day neural networks: unsupervised learning and the difficulty of
training deep networks. Their work initiated a new wave of interest in artificial neural
networks by introducing a new way of learning multiple layers of non-linear features
without requiring vast amounts of labeled data. Deep architectures had been proposed
before (Fukushima, 1980; LeCun et al., 1989; Schmidhuber, 1992; Utgoff and Stracuzzi,
2002) but without major success to jointly train a deep neural network with many
layers except to some extent in the case of convolutional architectures (LeCun et al.,
1989, 1998b), covered in Chapter 11. The breakthrough came from a semi-supervised
procedure: using unsupervised learning to learn one layer of features at a time and then
fine-tuning the whole system with labeled data (Hinton et al., 2006; Bengio et al., 2007;
Ranzato et al., 2007), described in Chapter 10. This initiated a lot of new research
and other ways of successfully training deep nets emerged. Even though unsupervised
pre-training is sometimes unnecessary for datasets with a very large number of labels,
it was the early success of unsupervised pre-training that led many new researchers
to investigate deep neural networks. In particular, Glorot et al. (2011a) showed for the
first time that deep supervised networks could be trained without unsupervised training,
using rectifiers (Nair and Hinton, 2010b) instead of the previously used forms of non-
linearity in neural networks, as well as appropriate initialization (Glorot and Bengio,
2010) allowing information to flow well both forward (to produce predictions from input)
and backward (to propagate error signals). A large fraction of the subsequent successes
of deep networks have relied on piecewise non-linearities such as the rectifier (Krizhevsky
et al., 2012a) and maxout (Goodfellow et al., 2013a).
1.4 Recent Impact of Deep Learning Research
Since 2010, deep learning has had spectacular practical successes. It has led to much
better acoustic models that have dramatically improved the state of the art in speech
recognition. Deep neural nets are now used in deployed speech recognition systems
including voice search on the Android (Dahl et al., 2010; Deng et al., 2010; Seide
et al., 2011; Hinton et al., 2012a). Deep convolutional nets have led to major advances
in the state of the art performance for recognizing large numbers of different types
of objects in images (now deployed in Google+ photo search). They have also had
spectacular successes for pedestrian detection and image segmentation (Sermanet et al.,
2013; Farabet et al., 2013; Couprie et al., 2013) and yielded superhuman performance
in traffic sign classification (Ciresan et al., 2012). An organization called Kaggle runs
machine learning competitions on the web. Deep learning has had numerous successes
in these competitions
9 10
.
The number of research groups doing deep learning has grown from just 3 in 2006
(Toronto, Montreal, NYU, all within NCAP) to 4 in 2007 (+Stanford), and to more
than 26 in 2013.
11
Accordingly, the number of papers published in this area has
9
http://blog.kaggle.com/2012/11/01/deep-learning-how- i-did-it-merck-1st-place-interview
10
http://www.nytimes.com/2012/11/24/science/scientists-see-advances- in- deep-learning-a-part-of-artificial-intelligence.html
11
http://deeplearning.net/deep-learning-research- groups-and-labs
15
skyrocketed. Before 2006, it had become very difficult to publish any paper having
to do with artificial neural networks at NIPS or ICML (the leading machine learning
conferences). In the last few years, deep learning has been added as a keyword or area
for submitted papers and sessions at NIPS and ICML. At ICML 2013 there was almost
a whole day devoted to deep learning papers. The first deep learning workshop was co-
organized by Yoshua Bengio, Yann LeCun, Ruslan Salakhutdinov and Hugo Larochelle
at NIPS’2007, in an unofficial session sponsored by CIFAR because the NIPS workshop
organizers had rejected the workshop proposal. It turned out to be the most popular
workshop that year (with around 200 participants). That popularity has continued year
after year, with Yoshua Bengio co-organizing most of the deep learning workshops at
NIPS or ICML since then. There are now even multiple workshops on deep learning
subjects (such as specialized workshops on the application of deep learning to speech or
to natural language).
This has led Yann LeCun and Yoshua Bengio to create a new conference on the sub-
ject. They called it the International Conference on Learning Representations
(ICLR) because its scope encompasses not just deep learning but the more general
subject of representation learning (which includes topics such as sparse coding, that
learns shallow representations, because shallow representation-learners can be used as
building blocks for deep representation-learners). The first ICLR was ICLR’2013, and
was a frank success, attracting more than 110 participants (almost twice as much as
the defunct conference which ICLR replaced, the Learning Conference). It was also an
opportunity to experiment with a novel reviewing and publishing model based on open
reviews and openly visible submissions (as arXiv papers), with the objective of achieving
a faster dissemination of information (not keeping papers hidden from view while being
evaluated) and a more open and interactive discussion between authors, reviewers and
non-anonymous spontaneous commenters.
The general media and media specialized in information technology have picked up
on deep learning as an exciting new technology since 2012, and we have even seen it
covered on television in 2013, on NBC
12
. It started with two articles in the New York
Times in 2012, both covering the work done at Stanford by Andrew Ng and his group.
In March 2013 it was announced (in particular in Wired
13
) that Google acqui-hired
Geoff Hinton, Ilya Sutskever, and Alex Krizhevsky to help them “supercharge” machine
learning applications at Google. In April 2013, the MIT Technology Review published
their annual list of 10 Breakthrough Technologies and they put deep learning first on
their list. This has stirred a lot of echoes and discussions around the web, including an
interview of Yoshua Bengio for Wired
14
, and there have since been many pieces, not
just in technology oriented media but also, for example, in business oriented media like
Forbes
15
. As writing of this book started, after the important investments from Google,
Microsoft and Baidu, the news was that Facebook was creating a new research group
12
http://video.cnbc.com/gallery/?play=1&video=3000192292
13
http://www.wired.com/wiredenterprise/2013/03/google_hinton/
14
http://www.wired.com/wiredenterprise/2013/06/yoshua-bengio
15
http://www.forbes.com/sites/netapp/2013/08/19/what-is-deep-learning/
16
devoted to deep learning
16
, led by Yann LeCun.
1.5 Challenges for Future Research
In spite of all the successes of deep learning to date and the obvious fact that deep
learning already has a great industrial impact, there is still a huge gap between the in-
formation processing and perception capabilities of even simple animals and our current
technology. Understanding more of the principles behind such information processing
architectures will still require much more basic science. Such fundamental research has
been essential to bring deep learning where it is today, and some of the challenges ahead
require even more of it, along with much deeper analysis and theory regarding even the
existing algorithms. Scaling up deep learning significantly will also require significant
systems engineering research. All of these future advances will probably lead to big
changes in how we build and deploy information technology in the future. While the
novel techniques that have been developed by deep learning researchers achieved im-
pressive progress, training deep networks to learn meaningful representations is not yet
a solved issue. Moreover while we believe that deep learning is a crucial component, it
is not a complete solution to AI. In this book, we review the current state of knowledge
on deep learning, and we also present our ideas for how to move beyond current deep
learning methods to help us move closer to achieving human-level AI. We identify some
major qualitative deficiencies in existing deep learning systems, and propose general
ideas for how to face them. Success on any of these fronts would represent a major
conceptual leap forward on the path to AI.
In the examples of outstanding applications of deep learning described above, the
impressive breakthroughs have mostly been achieved with supervised learning techniques
for deep architectures
17
. We believe that some of the most important future progress
in deep learning will hinge on achieving a similar impact in the unsupervised
18
and
semi-supervised
19
cases.
However, even in the supervised case, there are signs that the techniques used in
the regime of neural networks with a few thousand neurons per layer encounter training
difficulties when using them on networks that have 10 times more units per layer and 100
times more connections, making it difficult to exploit the increased model size (Dauphin
and Bengio, 2013a). Even though the scaling behavior of stochastic gradient descent is
theoretically very good in terms of computations per update, these observations suggest
a numerical optimization challenge that must be addressed.
In addition to these numerical optimization difficulties, scaling up large and deep
neural networks as they currently stand would require a substantial increase in com-
puting power, which remains a limiting factor of our research. Training much larger
models with the current hardware (or the hardware likely to be available in the next
16
http://www.technologyreview.com/news/519411/facebook-launches-advanced- ai- effort-to-find-meaning-in-your-posts
17
where the training examples are (input,target output) pairs, with the target output often being a
human-produced label.
18
no labels at all, only unlabeled examples
19
a mix of labeled and unlabeled examples
17
few years) will require a change in design and/or the ability to effectively exploit paral-
lel computation. These raise non-obvious questions where fundamental research is also
needed.
Furthermore, some of the biggest challenges remain in front of us regarding unsu-
pervised deep learning. Powerful unsupervised learning is important for many reasons:
Unsupervised learning allows a learner to take advantage of unlabeled data. Most
of the data available to machines (and to humans and animals) is unlabeled, i.e.,
without a precise and symbolic characterization of its semantics and of the outputs
desired from a learner. Humans and animals are also motivated, and this guides
research into learning algorithms based on a reinforcement signal, which is much
weaker than the signal required for supervised learning.
Whereas supervised learning always answers the same type of question (predict
y from x), unsupervised learning allows a learner to capture the most general
kind of information about the observed variables, so as to be able to answer new
questions about them in the future, that were not anticipated at the time of seeing
the training examples.
Unsupervised learning has been shown to be a good regularizer for supervised
learning (Erhan et al., 2010), meaning that it can help the learner generalize better,
especially when the number of labeled examples is small. This advantage clearly
shows up in practical applications (e.g., the transfer learning competitions won
by NCAP members with unsupervised deep learning (Bengio, 2011; Mesnil et al.,
2011; Goodfellow et al., 2011)) where the distribution changes or new classes or
domains are considered (transfer learning, domain adaptation), when some classes
are frequent while many others are rare (fat tail or Zipf distribution), or when
new classes are shown with zero, one or very few examples (zero-shot and one-shot
learning (Larochelle et al., 2008; Lake et al., 2013; Richard Socher and Ng, 2013)).
There is evidence suggesting that unsupervised learning can be successfully achieved
mostly from a local training signal (as indicated by the successes of the unsuper-
vised layer-wise pre-training procedures (Bengio, 2009), semi-supervised embed-
ding (Weston et al., 2008), and intermediate-level hints (Gulcehre and Bengio,
2013)), i.e., that it may suffer less from the difficulty of propagating credit across
a large network, which has been observed for supervised learning. Local train-
ing algorithms are also much more likely to be amenable to fast hardware-based
implementations.
20
Solving the core problems of unsupervised learning also allows us to solve core
problems with structured outputs tasks, where the output variable is very high-
dimensional, instead of just a few numbers or classes. For example, the output
could be a sentence or an image. In that case, the mathematical and computational
20
See for example DARPA’s UPSIDE program as evidence for recent interest in hardware-enabled
implementations of deep learning.
18
issues involved in unsupervised learning also arise because there is an exponentially
large number of configurations of the output values that need to be considered
(to sum over them, when computing probability gradients, or to find an optimal
configuration, when taking a decision).
To summarize, some of the challenges we view as important for future breakthroughs
in deep learning are the following:
How should we deal with the fundamental challenges behind unsupervised learning,
such as intractable inference and sampling? See Chapters 15, 16, and 17.
How can we build and train much larger and more adaptive and reconfigurable deep
architectures, thus maximizing the advantage one can draw from larger datasets?
See Chapter 8.
How can we improve the ability of deep learning algorithms to disentangle the
underlying factors of variation, or put more simply, make sense of the world around
us? See Chapter 14 on this very basic question about what is involved in learning
a good representation.
Many other challenges are not discussed in this book, such as the needed integration of
deep learning with reinforcement learning, active learning, and reasoning.
19