Chapter 1

Introduction

Inventors have long dreamed of creating machines that think. Ancient Greek myths tell

of intelligent objects, such as animated statues of human beings and tables that arrive

full of food and drink when called.

When programmable computers were ﬁrst conceived, people wondered whether they

might become intelligent, over a hundred years before one was built (Lovelace, 1842).

Today, artiﬁcial intelligence (AI) is a thriving ﬁeld with many practical applications

and active research topics. We look to intelligent software to automate routine labor,

understand speech or images, make diagnoses in medicine, and to support basic scientiﬁc

research.

In the early days of artiﬁcial intelligence, the ﬁeld rapidly tackled and solved prob-

lems that are intellectually diﬃcult for human beings but relatively straightforward for

computers—problems that can be described by a list of formal, mathematical rules. The

true challenge to artiﬁcial intelligence proved to be solving the tasks that are easy for

people to perform but hard for people to describe formally— problems that we solve

intuitively, that feel automatic, like recognizing spoken words or faces in images.

This book is about a solution to these more intuitive problems. This solution is

to allow computers to learn from experience and understand the world in terms of

a hierarchy of concept, with each concept deﬁned in terms of its relation to simpler

concepts. By gathering knowledge from experience, this approach avoids the need for

human operators to formally specify all of the knowledge that the computer needs. The

hierarchy of concepts allows the computer to learn complicated concepts by building

them out of simpler ones. If we draw a graph showing how these concepts are built on

top of each other, the graph is deep, with many layers. For this reason, we call this

approach to AI deep learning.

Many of the early successes of AI took place in relatively sterile and formal en-

vironments and did not require computers to have much knowledge about the world.

For example, IBM’s Deep Blue chess-playing system defeated world champion Garry

Kasparov in 1997 (Hsu, 2002). Chess is of course a very simple world, containing only

sixty-four locations and thirty-two pieces that can move in only rigidly circumscribed

ways. Devising a successful chess strategy is a tremendous accomplishment, but the

challenge is not due to the diﬃculty of describing the relevant concepts to the com-

puter. Chess can be completely described by a very brief list of completely formal rules,

easily provided ahead of time by the programmer.

Ironically, abstract and formal tasks such as chess that are among the most diﬃcult

mental undertakings for a human being are among the easiest for a computer. A person’s

everyday life requires an immense amount of knowledge about the world, and much

of this knowledge is subjective and intuitive, and therefore diﬃcult to articulate in a

formal way. Computers need to capture this same knowledge in order to behave in an

intelligent way. One of the key challenges in artiﬁcial intelligence is how to get this

informal knowledge into a computer.

Several artiﬁcial intelligence projects have sought to hard-code knowledge about the

world in formal languages. A computer can reason about statements in these formal

languages automatically using logical inference rules. This is known as the knowledge

base approach to artiﬁcial intelligence. None of these projects has lead to a major

success. One of the most famous such projects is Cyc

. Cyc (Lenat and Guha, 1989)

is an inference engine and a database of statements in a language called CycL. These

statements are entered by a staﬀ of human supervisors. It is an unwieldy process. People

struggle to devise formal rules with enough complexity to accurately describe the world.

For example, Cyc failed to understand a story about a person named Fred shaving in the

morning (Linde, 1992). Its inference engine detected an inconsistency in the story: it

knew that people do not have electrical parts, but because Fred was holding an electric

razor, it believed the entity “FredWhileShaving” contained electrical parts. It therefore

asked whether Fred was still a person while he was shaving.

The diﬃculties faced by systems relying on hard-coded knowledge suggest that AI

systems need the ability to acquire their own knowledge, by extracting patterns from

raw data. This capability is known as machine learning. The introduction of machine

learning allowed computers to tackle problems involving knowledge of the real world

and make decisions that appear subjective. A simple machine learning algorithm called

logistic regression

can determine whether to recommend cesarean delivery (Mor-Yosef

et al., 1990). A simple machine learning algorithm called naive Bayes can separate

legitimate e-mail from spam e-mail.

The performance of these simple machine learning algorithms depends heavily on

the representation of the data they are given. For example, when logistic regression

is used to recommend cesarean delivery, the AI system does not examine the patient

directly. Instead, the doctor tells the system several pieces of relevant information, such

as the presence or absence of a uterine scar. Each piece of information included in the

representation of the patient is known as a feature. Logistic regression learns how each

of these features of the patient correlates with various outcomes. However, it cannot

http://www.amazon.com/Building-Large-Knowledge-Based-Systems-Representation/dp/

0201517523

Logistic regression was developed in statistics to generalize linear regression to the prediction of the

conditional probability of categorical variables. It can be viewed as a multi-layer neural network with

no hidden layer, trained for classiﬁcation of labels y given inputs x with the conditional log-likelihood

criterion − log P (y|x). Note how very similar algorithms, such as logistic regression, have been developed

in parallel in the machine learning community and in the statistics community, often not using the same

language (Breiman, 2001).

Figure 1.1: Example of diﬀerent representations: suppose we want to separate two

categories of data by drawing a line between them in a scatterplot. In the plot on the

left, we represent some data using Cartesian coordinates, and the task is impossible. In

the plot on the left, we represent the data with polar coordinates and the task becomes

simple to solve with a vertical line. (Figure credit: David Warde-Farley)

inﬂuence the way that the features are deﬁned in any way. If logistic regression was

given a 3-D MRI image of the patient, rather than the doctor’s formalized report, it

would not be able to make useful predictions. Individual voxels

in an MRI scan have

negligible correlation with any complications that might occur during delivery.

This dependence on representations is a general phenomenon that appears through-

out computer science and even daily life. In computer science, operations such as search-

ing a collection of data can proceed exponentially faster if the collection is structured

and indexed intelligently. People can easily perform arithmetic on Arabic numerals, but

ﬁnd arithmetic on Roman numerals much more time consuming. It is not surprising

that the choice of representation has an enormous eﬀect on the performance of machine

learning algorithms. For a simple visual example, see Fig. 1.1.

Many artiﬁcial intelligence tasks can be solved by designing the right set of features

to extract for that task, then providing these features to a simple machine learning

algorithm. For example, a useful feature for speaker identiﬁcation from sound is the

pitch. The pitch can be formally speciﬁed—it is the lowest frequency major peak of the

spectrogram. It is useful for speaker identiﬁcation because it is determined by the size

of the vocal tract, and therefore gives a strong clue as to whether the speaker is a man,

woman, or child.

However, for many tasks, it is diﬃcult to know what features should be extracted. For

example, suppose that we would like to write a program to detect cars in photographs.

We know that cars have wheels, so we might like to use the presence of a wheel as a

feature. Unfortunately, it is diﬃcult to describe exactly what a wheel looks like in terms

A voxel is the value at a single point in a 3-D scan, much as a pixel as the value at a single point

in an image.

of pixel values. A wheel has a simple geometric shape but its image may be complicated

by shadows falling on the wheel, the sun glaring oﬀ the metal parts of the wheel, the

fender of the car or an object in the foreground obscuring part of the wheel, and so on.

One solution to this problem is to use machine learning to discover not only the map-

ping from representation to output but also the representation itself. This approach is

known as representation learning. Learned representations often result in much better

performance than can be obtained with hand-designed representations. They also al-

low AI systems to rapidly adapt to new tasks, with minimal human intervention. A

representation learning algorithm can discover a good set of features for a simple task

in minutes, or a complex task in hours to months. Manually designing features for a

complex task requires a great deal of human time and eﬀort; it can take decades for an

entire community of researchers.

The quintessential example of a representation learning algorithm is the autoencoder.

An autoencoder is the combination of an encoder function that converts the input data

into a diﬀerent representation, and a decoder function that converts the new represen-

tation back into the original format. Autoencoders are trained to preserve as much

information as possible when an input is run through the encoder and then the de-

coder, but are also trained to make the new representation have various nice properties.

(Diﬀerent kinds of autoencoders aim to achieve diﬀerent kinds of properties)

When designing features or algorithms for learning features, our goal is usually to

separate the factors of variation that explain the observed data. (In this context, we use

the word “factors” simply to refer to separate sources of inﬂuence; the factors are usually

not combined by multiplication) Such factors are often not quantities that are directly

observed but they exist either as unobserved objects or forces in the physical world that

aﬀect observable quantities, or they are constructs in the human mind that provide useful

simplifying explanations or inferred causes of the observed data. They can be thought

of as concepts or abstractions that help us make sense of the rich variability in the data.

When analyzing a speech recording, the factors of variation include the speaker’s age

and sex, their accent, and the words that they are speaking. When analyzing an image

of a car, the factors of variation include the position of the car, its color, and the angle

and brightness of the sun.

A major source of diﬃculty in many real-world artiﬁcial intelligence applications is

that many of the factors of variation inﬂuence every single piece of data we are able to

observe. The individual pixels in an image of a red car might be very close to black at

night. The shape of the car’s silhouette depends on the viewing angle. Most applications

require us to disentangle the factors of variation and discard the ones that we do not

care about.

Of course, it can be very diﬃcult to extract such high-level, abstract features from

raw data. Many of these factors of variation, such as a speaker’s accent, can only be

identiﬁed using sophisticated, nearly human-level understanding of the data. When it

is nearly as diﬃcult to obtain a representation as to solve the original problem, repre-

sentation learning does not, at ﬁrst glance, seem to help us.

Deep learning solves this central problem in representation learning by introducing

representations that are expressed in terms of other, simpler representations. Deep

learning allows the computer to build complex concepts out of simpler concepts. Fig. 1.2

shows how a deep learning system can represent the concept of an image of a person by

combining simpler concepts, such as corners and contours, which are in turn deﬁned in

terms of edges.

The quintessential example of a deep learning model is the multilayer perceptron

(MLP). A multilayer perceptron is just a mathematical function mapping some set

of input values to output values. The function is formed by composing many simpler

functions. We can think of each application of a diﬀerent mathematical function as

providing a new representation of the input.

The idea of learning the right representation for the data provides one perspective

on deep learning. Another perspective on deep learning is that it allows the computer to

learn a multi-step computer program. Each layer of the representation can be thought

of as the state of the computer’s memory after executing another set of instructions in

parallel. Networks with greater depth can execute more instructions in sequence. Being

able to execute instructions sequentially oﬀers great power because later instructions can

refer back to the results of earlier instructions. According to this view of deep learning,

not all of the information in a layer’s representation of the input necessarily encodes

factors of variation that explain the input. The representation is also used to store state

information that helps to execute a program that can make sense of the input. This

state information could be analagous to a counter or pointer in a traditional computer

program. It has nothing to do with the content of the input speciﬁcally, but it helps the

model to organize its processing.

“Depth” is not a mathematically rigorous term in this context; there is no formal

deﬁnition of deep learning and no generally accepted convention for measuring the depth

of a particular model. All approaches to deep learning share the idea of nested repre-

sentations of data, but diﬀerent approaches view depth in diﬀerent ways. For some

approaches, the depth of the system is the depth of the ﬂowchart describing the com-

putations needed to produce the ﬁnal representation. The depth corresponds roughly

to the number of times we update the representation (and of course, what one person

considers to be a single complex update, another person may consider to be multiple

simple updates, so even two people using this same basic approach to deﬁning depth

may not agree on the exact number of layers present in a model). Other approaches

consider depth to be the depth of the graph describing how concepts are related to each

other. In this case, the depth of the ﬂow-chart of the computations needed to compute

the representation of each concept may be much deeper than the graph of the concepts

themselves. This is because the system’s understanding of the simpler concepts can be

reﬁned given information about the more complex concepts. For example, an AI system

observing an image of a face with one eye in shadow may initially only see one eye.

After detecting that a face is present, it can then infer that a second eye is probably

present as well. In this case, the graph of concepts only includes two layers–a layer for

eyes and a layer for faces–but the graph of computations includes 2n layers if we reﬁne

our estimate of each concept given the other n times.

To summarize, deep learning, the subject of this book, is an approach to AI. Specif-

ically, it is a type of machine learning, a technique that allows computer systems to

Visualizing and Understanding Convolutional Networks

(a)

(b)

(e)

Figure 6. (a): 1st layer features without featu re scal e clip p ing . Note that one feature dominates. (b): 1st layer features

from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and ﬁlt er size (7x7 vs 11 x1 1 )

results in more disti n c ti ve features and fewer “dead” features. (d): Visualizations of 2nd layer fea tu re s from (Krizhev sky

et al., 2012) . (e): Visualizatio n s of our 2nd layer features. These are cleaner, with no aliasin g artifacts that are visible in

(d).

Car wheel

Racer

Cab

Police van

Pomeranian

Tennis ball

Keeshond

Pekinese

Afghan hound

Gordon setter

Irish setter

Mortarboard

Fur coat

Academic gown

Australian terrier

Ice lolly

Vizsla

Neck brace

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.05

0.1

0.15

0.2

0.25

True Label: Pomeranian

(a) Input Image (b) Layer 5, strongest feature map

feature map projections

(d) Classier, probability

of correct class

(e) Classier, most

probable class

True Label: Car Wheel

True Label: Afghan Hound

Figure 7. Three test examples where we systema t i ca l ly cover up di↵erent portions of th e scene with a gray square (1st

column) and see how the top (layer 5) feature maps ((b) & (c)) and classiﬁer outp u t ((d ) & (e)) changes. (b): for each

position of the gray scale, we record the total acti vation in one layer 5 feature map (the one with t he strongest response

in the unocclu d e d image). (c): a visualization of this feat u re map projected down into the input image (black square),

along with visu a li z a ti o n s of t h is map from other images. The ﬁrst row example shows the st ron g e st feature to be t h e

dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). (d): a map of correct

class probability, as a func t io n of the position of t h e gray square. E.g. when the dog’s face is obscured, the probability

for “pomeranian” drops sig n i ﬁ c antly. (e): the most proba b l e la bel as a funct io n of occ lu d e r position. E.g. in the 1st row,

for most locations it is “pomeranian”, but if the dog’s fa ce is obscured but not the ball, then it p red ic t s “tennis ba l l” . In

the 2nd example, text on th e car is the stro n g es t feat u re in layer 5, but the classiﬁer is most sensitive to the wheel. The

3rd example contains multiple objects. The strongest feature in layer 5 picks out the faces, but the classiﬁer is sensitive

to the dog (blue region in (d)), since it uses multip le feature maps.

Visualizing and Understanding Convolutional Networks

(a)

(b)

(e)

Figure 6. (a): 1st layer features without feature scale clip p ing . Note that one feature dominates. (b): 1st layer features

from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and ﬁlter size (7x7 vs 11x11)

results in more distin c t ive features and fewer “dead” features. (d): Visualizations of 2nd layer featu res from (Krizhev sk y

et al., 2012) . (e): Visualization s of our 2nd layer features. These a re cleaner, with no aliasing a rti fa c ts that are visible in

(d).

Car wheel

Racer

Cab

Police van

Pomeranian

Tennis ball

Keeshond

Pekinese

Afghan hound

Gordon setter

Irish setter

Mortarboard

Fur coat

Academic gown

Australian terrier

Ice lolly

Vizsla

Neck brace

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.05

0.1

0.15

0.2

0.25

True Label: Pomeranian

(a) Input Image (b) Layer 5, strongest feature map

feature map projections

(d) Classier, probability

of correct class

(e) Classier, most

probable class

True Label: Car Wheel

True Label: Afghan Hound

Figure 7. Three test examples where we systemat ic a ll y cover up di↵erent portions of the scene with a gray square ( 1 st

column) and see how the top (layer 5) feature maps ((b) & (c)) and classiﬁer output ((d) & (e)) changes. (b): for each

position of the gray scale, we record the total activation in one layer 5 feature map (the o n e with the stron g es t respo n s e

in the u n occluded image). (c): a v is u al iz a t io n of this feat u re map projected down into the input image (black square),

along with visu a li za t io n s of t h is map f ro m other images. The ﬁrst row example shows th e strongest feature to be the

dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). ( d ) : a map of correct

class probability, as a funct io n of the position of th e gray square. E.g. when t h e dog’s face is obscured, the probability

for “pomeranian” drops sig n iﬁ c a ntly. (e): the most probable l abel as a funct io n of occl u d er position. E.g. in the 1st row,

for most locations it is “pomeranian”, but if the dog’s face is obscured b u t not the b a ll , then it predicts “te n n is ball”. In

the 2nd example, text on the car is the stron g e st feat u re in layer 5, but the classiﬁ er is most sensit ive to the wheel. The

3rd example contains multiple objects. The strongest feature in layer 5 picks ou t the fac es, but the clas si ﬁ er is sensiti ve

to the dog (blue region in (d)), since it uses multiple feature maps.