Chapter 1
Introduction
Inventors have long dreamed of creating machines that think. Ancient Greek myths tell
of intelligent objects, such as animated statues of human beings and tables that arrive
full of food and drink when called.
When programmable computers were first conceived, people wondered whether they
might become intelligent, over a hundred years before one was built (Lovelace, 1842).
Today, artificial intelligence (AI) is a thriving field with many practical applications
and active research topics. We look to intelligent software to automate routine labor,
understand speech or images, make diagnoses in medicine, and to support basic scientific
research.
In the early days of artificial intelligence, the field rapidly tackled and solved prob-
lems that are intellectually difficult for human beings but relatively straightforward for
computers—problems that can be described by a list of formal, mathematical rules. The
true challenge to artificial intelligence proved to be solving the tasks that are easy for
people to perform but hard for people to describe formally— problems that we solve
intuitively, that feel automatic, like recognizing spoken words or faces in images.
This book is about a solution to these more intuitive problems. This solution is
to allow computers to learn from experience and understand the world in terms of
a hierarchy of concept, with each concept defined in terms of its relation to simpler
concepts. By gathering knowledge from experience, this approach avoids the need for
human operators to formally specify all of the knowledge that the computer needs. The
hierarchy of concepts allows the computer to learn complicated concepts by building
them out of simpler ones. If we draw a graph showing how these concepts are built on
top of each other, the graph is deep, with many layers. For this reason, we call this
approach to AI deep learning.
Many of the early successes of AI took place in relatively sterile and formal en-
vironments and did not require computers to have much knowledge about the world.
For example, IBM’s Deep Blue chess-playing system defeated world champion Garry
Kasparov in 1997 (Hsu, 2002). Chess is of course a very simple world, containing only
sixty-four locations and thirty-two pieces that can move in only rigidly circumscribed
ways. Devising a successful chess strategy is a tremendous accomplishment, but the
challenge is not due to the difficulty of describing the relevant concepts to the com-
4
puter. Chess can be completely described by a very brief list of completely formal rules,
easily provided ahead of time by the programmer.
Ironically, abstract and formal tasks such as chess that are among the most difficult
mental undertakings for a human being are among the easiest for a computer. A person’s
everyday life requires an immense amount of knowledge about the world, and much
of this knowledge is subjective and intuitive, and therefore difficult to articulate in a
formal way. Computers need to capture this same knowledge in order to behave in an
intelligent way. One of the key challenges in artificial intelligence is how to get this
informal knowledge into a computer.
Several artificial intelligence projects have sought to hard-code knowledge about the
world in formal languages. A computer can reason about statements in these formal
languages automatically using logical inference rules. This is known as the knowledge
base approach to artificial intelligence. None of these projects has lead to a major
success. One of the most famous such projects is Cyc
1
. Cyc (Lenat and Guha, 1989)
is an inference engine and a database of statements in a language called CycL. These
statements are entered by a staff of human supervisors. It is an unwieldy process. People
struggle to devise formal rules with enough complexity to accurately describe the world.
For example, Cyc failed to understand a story about a person named Fred shaving in the
morning (Linde, 1992). Its inference engine detected an inconsistency in the story: it
knew that people do not have electrical parts, but because Fred was holding an electric
razor, it believed the entity “FredWhileShaving” contained electrical parts. It therefore
asked whether Fred was still a person while he was shaving.
The difficulties faced by systems relying on hard-coded knowledge suggest that AI
systems need the ability to acquire their own knowledge, by extracting patterns from
raw data. This capability is known as machine learning. The introduction of machine
learning allowed computers to tackle problems involving knowledge of the real world
and make decisions that appear subjective. A simple machine learning algorithm called
logistic regression
2
can determine whether to recommend cesarean delivery (Mor-Yosef
et al., 1990). A simple machine learning algorithm called naive Bayes can separate
legitimate e-mail from spam e-mail.
The performance of these simple machine learning algorithms depends heavily on
the representation of the data they are given. For example, when logistic regression
is used to recommend cesarean delivery, the AI system does not examine the patient
directly. Instead, the doctor tells the system several pieces of relevant information, such
as the presence or absence of a uterine scar. Each piece of information included in the
representation of the patient is known as a feature. Logistic regression learns how each
of these features of the patient correlates with various outcomes. However, it cannot
1
http://www.amazon.com/Building-Large-Knowledge-Based-Systems-Representation/dp/
0201517523
2
Logistic regression was developed in statistics to generalize linear regression to the prediction of the
conditional probability of categorical variables. It can be viewed as a multi-layer neural network with
no hidden layer, trained for classification of labels y given inputs x with the conditional log-likelihood
criterion log P (y|x). Note how very similar algorithms, such as logistic regression, have been developed
in parallel in the machine learning community and in the statistics community, often not using the same
language (Breiman, 2001).
5
Figure 1.1: Example of different representations: suppose we want to separate two
categories of data by drawing a line between them in a scatterplot. In the plot on the
left, we represent some data using Cartesian coordinates, and the task is impossible. In
the plot on the left, we represent the data with polar coordinates and the task becomes
simple to solve with a vertical line. (Figure credit: David Warde-Farley)
influence the way that the features are defined in any way. If logistic regression was
given a 3-D MRI image of the patient, rather than the doctor’s formalized report, it
would not be able to make useful predictions. Individual voxels
3
in an MRI scan have
negligible correlation with any complications that might occur during delivery.
This dependence on representations is a general phenomenon that appears through-
out computer science and even daily life. In computer science, operations such as search-
ing a collection of data can proceed exponentially faster if the collection is structured
and indexed intelligently. People can easily perform arithmetic on Arabic numerals, but
find arithmetic on Roman numerals much more time consuming. It is not surprising
that the choice of representation has an enormous effect on the performance of machine
learning algorithms. For a simple visual example, see Fig. 1.1.
Many artificial intelligence tasks can be solved by designing the right set of features
to extract for that task, then providing these features to a simple machine learning
algorithm. For example, a useful feature for speaker identification from sound is the
pitch. The pitch can be formally specified—it is the lowest frequency major peak of the
spectrogram. It is useful for speaker identification because it is determined by the size
of the vocal tract, and therefore gives a strong clue as to whether the speaker is a man,
woman, or child.
However, for many tasks, it is difficult to know what features should be extracted. For
example, suppose that we would like to write a program to detect cars in photographs.
We know that cars have wheels, so we might like to use the presence of a wheel as a
feature. Unfortunately, it is difficult to describe exactly what a wheel looks like in terms
3
A voxel is the value at a single point in a 3-D scan, much as a pixel as the value at a single point
in an image.
6
of pixel values. A wheel has a simple geometric shape but its image may be complicated
by shadows falling on the wheel, the sun glaring off the metal parts of the wheel, the
fender of the car or an object in the foreground obscuring part of the wheel, and so on.
One solution to this problem is to use machine learning to discover not only the map-
ping from representation to output but also the representation itself. This approach is
known as representation learning. Learned representations often result in much better
performance than can be obtained with hand-designed representations. They also al-
low AI systems to rapidly adapt to new tasks, with minimal human intervention. A
representation learning algorithm can discover a good set of features for a simple task
in minutes, or a complex task in hours to months. Manually designing features for a
complex task requires a great deal of human time and effort; it can take decades for an
entire community of researchers.
The quintessential example of a representation learning algorithm is the autoencoder.
An autoencoder is the combination of an encoder function that converts the input data
into a different representation, and a decoder function that converts the new represen-
tation back into the original format. Autoencoders are trained to preserve as much
information as possible when an input is run through the encoder and then the de-
coder, but are also trained to make the new representation have various nice properties.
(Different kinds of autoencoders aim to achieve different kinds of properties)
When designing features or algorithms for learning features, our goal is usually to
separate the factors of variation that explain the observed data. (In this context, we use
the word “factors” simply to refer to separate sources of influence; the factors are usually
not combined by multiplication) Such factors are often not quantities that are directly
observed but they exist either as unobserved objects or forces in the physical world that
affect observable quantities, or they are constructs in the human mind that provide useful
simplifying explanations or inferred causes of the observed data. They can be thought
of as concepts or abstractions that help us make sense of the rich variability in the data.
When analyzing a speech recording, the factors of variation include the speaker’s age
and sex, their accent, and the words that they are speaking. When analyzing an image
of a car, the factors of variation include the position of the car, its color, and the angle
and brightness of the sun.
A major source of difficulty in many real-world artificial intelligence applications is
that many of the factors of variation influence every single piece of data we are able to
observe. The individual pixels in an image of a red car might be very close to black at
night. The shape of the car’s silhouette depends on the viewing angle. Most applications
require us to disentangle the factors of variation and discard the ones that we do not
care about.
Of course, it can be very difficult to extract such high-level, abstract features from
raw data. Many of these factors of variation, such as a speaker’s accent, can only be
identified using sophisticated, nearly human-level understanding of the data. When it
is nearly as difficult to obtain a representation as to solve the original problem, repre-
sentation learning does not, at first glance, seem to help us.
Deep learning solves this central problem in representation learning by introducing
representations that are expressed in terms of other, simpler representations. Deep
7
learning allows the computer to build complex concepts out of simpler concepts. Fig. 1.2
shows how a deep learning system can represent the concept of an image of a person by
combining simpler concepts, such as corners and contours, which are in turn defined in
terms of edges.
The quintessential example of a deep learning model is the multilayer perceptron
(MLP). A multilayer perceptron is just a mathematical function mapping some set
of input values to output values. The function is formed by composing many simpler
functions. We can think of each application of a different mathematical function as
providing a new representation of the input.
The idea of learning the right representation for the data provides one perspective
on deep learning. Another perspective on deep learning is that it allows the computer to
learn a multi-step computer program. Each layer of the representation can be thought
of as the state of the computer’s memory after executing another set of instructions in
parallel. Networks with greater depth can execute more instructions in sequence. Being
able to execute instructions sequentially offers great power because later instructions can
refer back to the results of earlier instructions. According to this view of deep learning,
not all of the information in a layer’s representation of the input necessarily encodes
factors of variation that explain the input. The representation is also used to store state
information that helps to execute a program that can make sense of the input. This
state information could be analagous to a counter or pointer in a traditional computer
program. It has nothing to do with the content of the input specifically, but it helps the
model to organize its processing.
“Depth” is not a mathematically rigorous term in this context; there is no formal
definition of deep learning and no generally accepted convention for measuring the depth
of a particular model. All approaches to deep learning share the idea of nested repre-
sentations of data, but different approaches view depth in different ways. For some
approaches, the depth of the system is the depth of the flowchart describing the com-
putations needed to produce the final representation. The depth corresponds roughly
to the number of times we update the representation (and of course, what one person
considers to be a single complex update, another person may consider to be multiple
simple updates, so even two people using this same basic approach to defining depth
may not agree on the exact number of layers present in a model). Other approaches
consider depth to be the depth of the graph describing how concepts are related to each
other. In this case, the depth of the flow-chart of the computations needed to compute
the representation of each concept may be much deeper than the graph of the concepts
themselves. This is because the system’s understanding of the simpler concepts can be
refined given information about the more complex concepts. For example, an AI system
observing an image of a face with one eye in shadow may initially only see one eye.
After detecting that a face is present, it can then infer that a second eye is probably
present as well. In this case, the graph of concepts only includes two layers–a layer for
eyes and a layer for faces–but the graph of computations includes 2n layers if we refine
our estimate of each concept given the other n times.
To summarize, deep learning, the subject of this book, is an approach to AI. Specif-
ically, it is a type of machine learning, a technique that allows computer systems to
8
Visualizing and Understanding Convolutional Networks
(a)
(b)
(c) (d)
(e)
Figure 6. (a): 1st layer features without featu re scal e clip p ing . Note that one feature dominates. (b): 1st layer features
from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and filt er size (7x7 vs 11 x1 1 )
results in more disti n c ti ve features and fewer “dead” features. (d): Visualizations of 2nd layer fea tu re s from (Krizhev sky
et al., 2012) . (e): Visualizatio n s of our 2nd layer features. These are cleaner, with no aliasin g artifacts that are visible in
(d).
Car wheel
Racer
Cab
Police van
Pomeranian
Tennis ball
Keeshond
Pekinese
Afghan hound
Gordon setter
Irish setter
Mortarboard
Fur coat
Academic gown
Australian terrier
Ice lolly
Vizsla
Neck brace
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.05
0.1
0.15
0.2
0.25
True Label: Pomeranian
(a) Input Image (b) Layer 5, strongest feature map
(c) Layer 5, strongest
feature map projections
(d) Classier, probability
of correct class
(e) Classier, most
probable class
True Label: Car Wheel
True Label: Afghan Hound
Figure 7. Three test examples where we systema t i ca l ly cover up dierent portions of th e scene with a gray square (1st
column) and see how the top (layer 5) feature maps ((b) & (c)) and classifier outp u t ((d ) & (e)) changes. (b): for each
position of the gray scale, we record the total acti vation in one layer 5 feature map (the one with t he strongest response
in the unocclu d e d image). (c): a visualization of this feat u re map projected down into the input image (black square),
along with visu a li z a ti o n s of t h is map from other images. The rst row example shows the st ron g e st feature to be t h e
dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). (d): a map of correct
class probability, as a func t io n of the position of t h e gray square. E.g. when the dog’s face is obscured, the probability
for “pomeranian” drops sig n i fi c antly. (e): the most proba b l e la bel as a funct io n of occ lu d e r position. E.g. in the 1st row,
for most locations it is “pomeranian”, but if the dog’s fa ce is obscured but not the ball, then it p red ic t s tennis ba l l” . In
the 2nd example, text on th e car is the stro n g es t feat u re in layer 5, but the classifier is most sensitive to the wheel. The
3rd example contains multiple objects. The strongest feature in layer 5 picks out the faces, but the classifier is sensitive
to the dog (blue region in (d)), since it uses multip le feature maps.
Visualizing and Understanding Convolutional Networks
(a)
(b)
(c) (d)
(e)
Figure 6. (a): 1st layer features without feature scale clip p ing . Note that one feature dominates. (b): 1st layer features
from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and filter size (7x7 vs 11x11)
results in more distin c t ive features and fewer “dead” features. (d): Visualizations of 2nd layer featu res from (Krizhev sk y
et al., 2012) . (e): Visualization s of our 2nd layer features. These a re cleaner, with no aliasing a rti fa c ts that are visible in
(d).
Car wheel
Racer
Cab
Police van
Pomeranian
Tennis ball
Keeshond
Pekinese
Afghan hound
Gordon setter
Irish setter
Mortarboard
Fur coat
Academic gown
Australian terrier
Ice lolly
Vizsla
Neck brace
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.05
0.1
0.15
0.2
0.25
True Label: Pomeranian
(a) Input Image (b) Layer 5, strongest feature map
(c) Layer 5, strongest
feature map projections
(d) Classier, probability
of correct class
(e) Classier, most
probable class
True Label: Car Wheel
True Label: Afghan Hound
Figure 7. Three test examples where we systemat ic a ll y cover up dierent portions of the scene with a gray square ( 1 st
column) and see how the top (layer 5) feature maps ((b) & (c)) and classifier output ((d) & (e)) changes. (b): for each
position of the gray scale, we record the total activation in one layer 5 feature map (the o n e with the stron g es t respo n s e
in the u n occluded image). (c): a v is u al iz a t io n of this feat u re map projected down into the input image (black square),
along with visu a li za t io n s of t h is map f ro m other images. The rst row example shows th e strongest feature to be the
dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). ( d ) : a map of correct
class probability, as a funct io n of the position of th e gray square. E.g. when t h e dog’s face is obscured, the probability
for “pomeranian” drops sig n ifi c a ntly. (e): the most probable l abel as a funct io n of occl u d er position. E.g. in the 1st row,
for most locations it is “pomeranian”, but if the dog’s face is obscured b u t not the b a ll , then it predicts “te n n is ball”. In
the 2nd example, text on the car is the stron g e st feat u re in layer 5, but the classifi er is most sensit ive to the wheel. The
3rd example contains multiple objects. The strongest feature in layer 5 picks ou t the fac es, but the clas si fi er is sensiti ve
to the dog (blue region in (d)), since it uses multiple feature maps.
Visualizing and Understanding Convolutional Networks
(a)
(b)
(c) (d)
(e)
Figure 6. (a): 1st layer features without feature scale clip p ing . Note that one feature dominates. (b): 1st layer features
from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and filter size (7x7 vs 11x11)
results in more distin c t ive features and fewer “dead” features. (d): Visualizations of 2nd layer featu res from (Krizhev sk y
et al., 2012) . (e): Visualization s of our 2nd layer features. These a re cleaner, with no aliasing a rti fa c ts that are visible in
(d).
Car wheel
Racer
Cab
Police van
Pomeranian
Tennis ball
Keeshond
Pekinese
Afghan hound
Gordon setter
Irish setter
Mortarboard
Fur coat
Academic gown
Australian terrier
Ice lolly
Vizsla
Neck brace
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.05
0.1
0.15
0.2
0.25
True Label: Pomeranian
(a) Input Image (b) Layer 5, strongest feature map
(c) Layer 5, strongest
feature map projections
(d) Classier, probability
of correct class
(e) Classier, most
probable class
True Label: Car Wheel
True Label: Afghan Hound
Figure 7. Three test examples where we systemat ic a ll y cover up dierent portions of the scene with a gray square ( 1 st
column) and see how the top (layer 5) feature maps ((b) & (c)) and classifier output ((d) & (e)) changes. (b): for each
position of the gray scale, we record the total activation in one layer 5 feature map (the o n e with the stron g es t respo n s e
in the u n occluded image). (c): a v is u al iz a t io n of this feat u re map projected down into the input image (black square),
along with visu a li za t io n s of t h is map f ro m other images. The rst row example shows th e strongest feature to be the
dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). ( d ) : a map of correct
class probability, as a funct io n of the position of th e gray square. E.g. when t h e dog’s face is obscured, the probability
for “pomeranian” drops sig n ifi c a ntly. (e): the most probable l abel as a funct io n of occl u d er position. E.g. in the 1st row,
for most locations it is “pomeranian”, but if the dog’s face is obscured b u t not the b a ll , then it predicts “te n n is ball”. In
the 2nd example, text on the car is the stron g e st feat u re in layer 5, but the classifi er is most sensit ive to the wheel. The
3rd example contains multiple objects. The strongest feature in layer 5 picks ou t the fac es, but the clas si fi er is sensiti ve
to the dog (blue region in (d)), since it uses multiple feature maps.
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a fully trained model. For layers 2-5 we show th e top 9 activations in a random subset
of feature maps across the validation data, projected down to pixel space using our deconvolutional network approach.
Our rec o n st ru c t io n s are not samples from the model: they are reco n st ruc ted pat te rn s from the validation set that cause
high activations in a given feature map. Fo r each feature map we also show the corresponding image patches. No t e:
(i) the the strong grouping within each feature map, (ii) greater invariance at hig h er layers and (iii) exaggeration of
discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form.
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a fully train e d model. For layers 2-5 we show th e top 9 activations in a random subset
of feature maps across the validation data, projected down to pixel space using our deconvolution a l network approach.
Our rec on s t ru ct i on s are not samples from t h e model: they are reconstructed patterns from the validation set that cause
high activations in a given feature map. For each feature map we also show the corresponding image patches. Note:
(i) the the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exaggeration of
discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form.
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a fully train e d model. For layers 2-5 we show th e top 9 activations in a random subset
of feature maps across the validation data, projected down to pixel space using our deconvolution a l network approach.
Our rec on s t ru ct i on s are not samples from t h e model: they are reconstructed patterns from the validation set that cause
high activations in a given feature map. For each feature map we also show the corresponding image patches. Note:
(i) the the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exaggeration of
discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form.
Visualizing and Understanding Convolutional Networks
Figure 2. Vi su a l iz a t io n of fea t u res in a fully trained model. For layers 2-5 we show the top 9 activation s in a rando m subset
of feat u re ma p s across the va l id a t io n data , projec t ed down to pix el space using our dec o nvolutional network approach.
Our re co n s tru c t io n s are not samples from t h e model: they are recon s t ruct ed patt ern s from t h e valida t i on set that cause
high activation s in a g iven feature map. For each feature map we als o show the corresponding image patche s. Note:
(i) the the strong group in g wit h in each feature map, (i i) grea t er invariance at hig h er layers and (iii) exaggeratio n of
discrim in a t ive parts of t h e image, e.g. eyes and noses of dog s (layer 4, row 1, cols 1 ) . Be st viewed in ele ct ro n i c form.
Visible layer
(input pixels)
1st hidden layer
(edges)
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a ful ly trained model. For layers 2-5 we show the top 9 activations in a random subset
of feature maps a c ro ss the validation data, project ed down to pixel spa c e using our deconvolutional network approach.
Our reconstructions are not samples from the model: they are reconstructed p a t t ern s from the validation set tha t cause
high activations in a given feat u re map. For each feature map we also show the corresponding image patches. Note:
(i) the the strong grouping within each feature map, (ii) grea t er invariance at higher layers and (iii) exaggeration of
discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form.
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a ful ly trained model. For layers 2-5 we show the top 9 activations in a random subset
of feature maps a c ro ss the validation data, project ed down to pixel spa c e using our deconvolutional network approach.
Our reconstructions are not samples from the model: they are reconstructed p a t t ern s from the validation set tha t cause
high activations in a given feat u re map. For each feature map we also show the corresponding image patches. Note:
(i) the the strong grouping within each feature map, (ii) grea t er invariance at higher layers and (iii) exaggeration of
discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form.
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a ful ly trained model. For layers 2-5 we show the top 9 activations in a random subset
of feature maps a c ro ss the validation data, project ed down to pixel spa c e using our deconvolutional network approach.
Our reconstructions are not samples from the model: they are reconstructed p a t t ern s from the validation set tha t cause
high activations in a given feat u re map. For each feature map we also show the corresponding image patches. Note:
(i) the the strong grouping within each feature map, (ii) grea t er invariance at higher layers and (iii) exaggeration of
discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form.
2nd hidden layer
(corners and
contours)
3rd hidden layer
(object parts)
CAR PERSON ANIMAL
Output
(object identity)
Figure 1.2: Illustration of a deep learning model. It is difficult for a computer to un-
derstand the meaning of raw sensory input data, such as this image represented as a
collection of pixel values. The function mapping from a set of pixels to an object identity
is very complicated. Learning or evaluating this mapping seems insurmountable if tack-
led directly. Deep learning resolves this difficulty by breaking the desired complicated
mapping into a series of nested simple mappings, each described by a different layer of
the model. The input is presented at the visible layer, so named because it contains the
variables that we are able to observe. Then a series of hidden layers extracts increasingly
abstract features from the image. These layers are called “hidden” because their values
are not given in the data; instead the model must determine which concepts are useful
for explaining the relationships in the observed data. The images here are visualizations
of the kind of feature represented by each hidden unit. Given the pixels, the first layer
can easily identify edges, by comparing the brightness of neighboring pixels. Given the
first hidden layer’s description of the edges, the second hidden layer can easily search for
corners and extended contours, which are recognizable as collections of edges. Given the
second hidden layer’s description of the image in terms of corners and contours, the third
hidden layer can detect entire parts of specific objects, by finding specific collections of
contours and corners. Finally, this description of the image in terms of the object parts
it contains can be used to recognize the objects present in the image. Images reproduced
with permission from Zeiler and Fergus (2014).
9
AI
Machine learning
Representation learning
Deep learning
Example:
Knowledge
bases
Example:
Logistic
regression
Example:
Autoencoders
Example:
MLPs
Figure 1.3: A Venn diagram showing how deep learning is a kind of representation
learning, which is in turn a kind of machine learning, which is used for many but not
all approaches to AI. Each section of the Venn diagram includes an example of an AI
technology.
improve with experience and data. According to the authors of this book, machine
learning is the only viable approach to building AI systems that can operate in compli-
cated, real-world environments. Deep learning is a particular kind of machine learning
that achieves great power and flexibility by learning to represent the world as a nested
hierarchy of concepts, with each concept defined in relation to simpler concepts. Fig. 1.3
illustrates the relationship between these different AI disciplines. Fig. 1.4 gives a high-
level schematic of how each works.
1.1 Who Should Read This Book?
This book can be useful for a variety of readers, but we wrote it with two main target
audiences in mind. One of these target audiences is university students (undergraduate
or graduate) learning about machine learning, including those who are beginning a
10
Input
Hand-
designed
program
Output
Input
Hand-
designed
features
Mapping
from
features
Output
Input
Features
Mapping
from
features
Output
Input
Simplest
features
Mapping
from
features
Output
Most
complex
features
Rule-based
systems
Classic
machine
learning
Representation
learning
Deep
learning
Figure 1.4: Flow-charts showing how the different parts of an AI system relate to each
other within different AI disciplines. Shaded boxes indicate components that are able
to learn from data.
11
career in deep learning and artificial intelligence research. The other target audience
is software engineers who do not have a machine learning or statistics background, but
want to rapidly acquire one and begin using deep learning in their product or platform.
Software engineers working in a wide variety of industries are likely to find deep learning
to be useful, as it has already proven successful in many areas including computer vision,
speech and audio processing, natural language processing, robotics, bioinformatics and
chemistry, video games, search engines, online advertising, and finance.
This book has been organized into three parts in order to best accommodate a variety
of readers. Part 1 introduces basic mathematical tools and machine learning concepts.
Part 2 describes the most established deep learning algorithms that are essentially solved
technologies. Part 3 describes more speculative ideas that are widely believed to be
important for future research in deep learning.
Readers should feel free to skip parts that are not relevant given their interests or
background. Readers familiar with linear algebra, probability, and fundamental machine
learning concepts can skip part 1, for example, while readers who just want to implement
a working system need not read beyond part 2.
We do assume that all readers come from a computer science background. We assume
familiarity with programming, a basic understanding of computational performance is-
sues, complexity theory, introductory level calculus, and some of the terminology of
graph theory.
1.2 Historical Trends in Deep Learning
While the term “deep learning” is relatively new, the field dates back to the 1950s. The
field has been rebranded many times, reflecting the influence of different researchers and
different perspectives. Previous names include “artificial neural networks,” “parallel
distributed processing,” and “connectionism.”
One important perspective in the history of deep learning is the idea that artificial
intelligence should draw inspiration from the brain (whether the human brain or the
brains of animals). This perspective gave rise to the “neural network” terminology.
Unfortunately, we know extremely little about the brain. The brain contains billions
of neurons with tens of thousands of connections between neurons. We are not yet
able to accurately record the individual activities of more than a handful of neurons
simultaneously. Consequently, we do not have the right kind of data to reverse engineer
the algorithms used by the brain. Deep learning algorithms resemble the brain insofar
as both the brain and deep learning models involve a very large number of computation
units that are not especially intelligent in isolation but become intelligent when they
interact with each other. Beyond that, it is difficult to say how similar the two are,
and unlikely that they have many similarities, and our knowledge of the brain does
not give very specific guidance for improving deep learning. For these reasons, modern
terminology no longer emphasizes the biological inspiration of deep learning algorithms.
Deep learning has now drawn considerably useful insights from many fields other than
neuroscience, including structured probabilistic models and manifold learning, and the
12
modern terminology aims to avoid implying that only one field has inspired the current
algorithms.
One may wonder why deep learning has only recently become recognized as a crucial
technology if it has existed since the 1950s. Deep learning has been successfully used
in commercial applications since the 1990s, but was often regarded as being more of an
art than a technology and something that only an expert could use until recently. It is
true that some skill is required to get good performance from a deep learning algorithm.
Fortunately, the amount of skill required reduces as the amount of training data and the
size of the model increases. In the age of “Big Data” we now have large enough training
sets to make deep learning algorithms consistently perform well (see Fig. 1.5), and fast
enough CPUs or GPUs and enough memory to train very large models (See Fig. 1.6 and
Fig. 1.7). The algorithms reaching human performance on complex tasks today are very
similar to the algorithms that struggled to solve toy problems in the 1980s—the most
important difference is that today we can provide these algorithms with the resources
they need to succeed.
The earliest predecessors of modern deep learning were simple linear models moti-
vated from a neuroscientific perspective. These models took a vector of n input values x
and computed a simple function f(x) =
P
n
i=1
w
i
x
i
using a vector of learned “weights”
w. The Perceptron (Rosenblatt, 1958, 1962) could recognize two different categories of
inputs by testing whether f(x) is positive or negative. The Adaptive Linear Element
(ADALINE) simply returned the value of f(x) itself to predict a real number (Widrow
and Hoff, 1960).
These simple learning algorithms greatly affected the modern landscape of machine
learning. ADALINE can be seen as using the stochastic gradient descent algorithm,
which is still in use in state of the art deep learning algorithms today with only slight
modification, to train a linear regression model, which is still used in cases where we
prefer speed or interpretability of the model over the ability to fit complex training sets,
or where we have too little training data or too noisy of a relationship between inputs
and outputs to fit a more complicated model.
Unfortunately, the limitations of these linear models lead to a backlash against bio-
logically inspired machine learning in general (Minsky and Papert, 1969) and other ap-
proaches dominated AI until the early 1980s. In the mid-1980’s, the back-propagation
algorithm enabled the extension of biologically-inspired machine learning approaches
to more complex models that incorporated non-linear behavior via the introduction of
hidden layers (Rumelhart et al., 1986a; LeCun, 1987). Neural networks became popular
again until the mid 1990s. At that point, the popularity of neural networks declined
again. This was in part due to a negative reaction to the failure of neural networks
to fulfill excessive promises made by a variety of people seeking investment in neural
network-based ventures, but also due to improvements in other fields of machine learn-
ing that were more amenable to theoretical analysis. Kernel machines (Boser et al.,
1992; Cortes and Vapnik, 1995; Sch¨olkopf et al., 1999) and graphical models (Jordan,
1998) became the main focus of academic study, while hand-designing domain-specific
features became the typical approach to practical applications. During this time, neu-
ral networks continued to obtain impressive performance on some tasks (LeCun et al.,
13
1900 1950 1985 2000 2015
Year (logarithmic scale)
10
0
10
1
10
2
10
3
10
4
10
5
10
6
10
7
Dataset size (number examples, logarithmic scale)
Iris
MNIST
Public SVHN
ImageNet
CIFAR-10
ImageNet10k
ILSVRC 2014
Sports-1M
Rotated T vs C
T vs G vs F
Criminals
Increasing dataset size over time
Figure 1.5: Dataset sizes have increased greatly over time. In the early 1900s, statis-
ticians studied datasets using hundreds or thousands of manually compiled measure-
ments (Garson, 1900; Gosset, 1908; Anderson, 1935; Fisher, 1936). In the 1950s through
1980s, the pioneers of biologically-inspired machine learning often worked with small,
synthetic datasets, such as low-resolution bitmaps of letters, that were designed to incur
low computational cost and demonstrate that neural networks were able to learn specific
kinds of functions (Widrow and Hoff, 1960; Rumelhart et al., 1986b). In the 1980s and
1990s, machine learning became more statistical in nature and began to leverage larger
datasets containing tens of thousands of examples such as the MNIST dataset of scans
of handwritten numbers (LeCun et al., 1998a). In the first decade of the 2000s, more
sophisticated datasets of this same size, such as the CIFAR-10 dataset (Krizhevsky and
Hinton, 2009) continued to be produced. Toward the end of that decade and throughout
the first half of the 2010s, significantly larger datasets, containing hundreds of thousands
to tens of millions of examples, completely changed what was possible with deep learn-
ing. These datasets included the public Street View House Numbers dataset(Netzer
et al., 2011), various versions of the ImageNet dataset (Deng et al., 2009, 2010; Rus-
sakovsky et al., 2014), and the Sports-1M dataset (Karpathy et al., 2014). Deep learning
methods so far require large, labeled datasets to succeed. As of 2015, a rough rule of
thumb is that a supervised deep learning algorithm will generally achieve acceptable
performance with around 5,000 labeled examples per category, and will match or exceed
human performance when trained with a dataset containing at least 10 million labeled
examples.
14
1950 1985 2000 2015
Year (logarithmic scale)
10
1
10
2
10
3
10
4
Connections per neuron (logarithmic scale)
1
2
3
4
5
6
7
8
9
10
Fruit fly
Mouse
Cat
Human
Number of connections per neuron over time
Figure 1.6: Initially, the number of connections between neurons in artificial neural net-
works was limited by hardware capabilities. Today, the number of connections between
neurons is mostly a design consideration. Some artificial neural networks have nearly as
many connections per neuron as a cat, and it is quite common for other neural networks
to have as many connections per neuron as smaller mammals like mice. Even the hu-
man brain does not have an exorbitant amount of connections per neuron. The sparse
connectivity of biological neural networks means that our artificial networks are able to
match the performance of biological neural networks despite limited hardware. Modern
neural networks are much smaller than the brains of any vertebrate animal, but we typ-
ically train each network to perform just one task, while an animal’s brain has different
areas devoted to different tasks. Biological neural network sizes from Wikipedia (2015).
1. Adaptive Linear Element (Widrow and Hoff, 1960)
2. Neocognitron (Fukushima, 1980)
3. GPU-accelerated convolutional network (Chellapilla et al., 2006)
4. Deep Boltzmann machines (Salakhutdinov and Hinton, 2009a)
5. Unsupervised convolutional network (Jarrett et al., 2009a)
6. GPU-accelerated multilayer perceptron (Ciresan et al., 2010)
7. Distributed autoencoder (Le et al., 2012)
8. Multi-GPU convolutional network (Krizhevsky et al., 2012a)
9. COTS HPC unsupervised convolutional network (Coates et al., 2013)
10. GoogLeNet (Szegedy et al., 2014)
15
1998a; Bengio et al., 2001a).
Research groups led by Geoffrey Hinton at University of Toronto, Yoshua Bengio
at University of Montreal, and Yann LeCun at New York University re-popularized
neural networks re-branded as “deep learning” beginning in 2006. At this time, it was
believed that the primary difficulty in using deep learning was optimizing the non-convex
functions involved in neural network training. Until approximately 2012, most research
in deep learning focused on using unsupervised learning to “pretrain” each layer of the
network in isolation, so that the final supervised training stage would not need to modify
the network’s parameters greatly (Hinton et al., 2006; Bengio et al., 2007; Ranzato et al.,
2007).
Since 2012, the greatest successes in deep learning have come not from this layerwise
pretraining scheme but simply from applying traditional supervised learning techniques
to large models on large datasets. TODO something about piecewise linear units and
dropout
TODO: scan through this section above, find terms that haven’t been introduced
well, like convexity, unsupervised pretraining, etc., and introduce them appropriately
TODO: get this idea in somwhere Deep architectures had been proposed before (Fukushima,
1980; LeCun et al., 1989; Schmidhuber, 1992; Utgoff and Stracuzzi, 2002) but without
major success to jointly train a deep neural network with many layers except to some
extent in the case of convolutional architectures (LeCun et al., 1989, 1998c)
TODO neocognitron
16
1950 1985 2000 2015 2056
Year (logarithmic scale)
10
2
10
1
10
0
10
1
10
2
10
3
10
4
10
5
10
6
10
7
10
8
10
9
10
10
10
11
Number of neurons (logarithmic scale)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Sponge
Roundworm
Leech
Ant
Bee
Frog
Octopus
Human
Increasing neural network size over time
Figure 1.7: Since the introduction of hidden units, artificial neural networks have dou-
bled in size roughly every 2.4 years. This growth is driven primarily by faster computers
with larger memory, but also by the availability of larger datasets. These larger net-
works are able to achieve higher accuracy on more complex tasks. This trend looks
set to continue for many years. Unless new technologies allow faster scaling, artificial
neural networks will not have the same number of neurons as the human brain until
2056. Real biological neurons are likely to represent more complicated functions than
current artificial neurons, so biological neural networks may be even larger than this
plot portrays. Biological neural network sizes from Wikipedia (2015).
1. Perceptron (Rosenblatt, 1958, 1962)
2. Adaptive Linear Element (Widrow and Hoff, 1960)
3. Neocognitron (Fukushima, 1980)
4. Early backpropagation network (Rumelhart et al., 1986b)
5. Recurrent neural network for speech recognition (Robinson and Fallside, 1991)
6. Multilayer perceptron for speech recognition (Bengio et al., 1991)
7. Mean field sigmoid belief network (Saul et al., 1996)
8. LeNet-5 (LeCun et al., 1998b)
9. Echo state network (Jaeger and Haas, 2004)
10. Deep belief network (Hinton et al., 2006)
11. GPU-accelerated convolutional network (Chellapilla et al., 2006)
12. Deep Boltzmann machines (Salakhutdinov and Hinton, 2009a)
13. GPU-accelerated deep belief network (Raina et al., 2009)
14. Unsupervised convolutional network (Jarrett et al., 2009a)
15. GPU-accelerated multilayer perceptron (Ciresan et al., 2010)
16. OMP-1 network (Coates and Ng, 2011)
17. Distributed autoencoder (Le et al., 2012)
18. Multi-GPU convolutional network (Krizhevsky et al., 2012a)
19. COTS HPC unsupervised convolutional network (Coates et al., 2013)
20. GoogLeNet (Szegedy et al., 2014)
17