Chapter 1

Deep Learning for AI

Inventors have long dreamed of creating machines that think. Ancient Greek myths tell

of intelligent objects, such as animated statues of human beings and tables that arrive

full of food and drink when called . When programmable computers were ﬁrst conceived,

people wondered whether they might become intelligent, over a hundred years before

one was built (Lovelace, 1842). Today, artiﬁcial intelligence (AI) is a thriving ﬁeld with

many practical applications and active research topics. We look to intelligent software

to automate routine labor, understand speech or images, make diagnoses in medicine,

and to support basic scientiﬁc research. This book is about deep learning, an approach

to AI based on enabling computers to learn from experience and understand the world

in terms of a hierarchy of concepts, with each concept deﬁned in terms of its relation to

simpler concepts.

Many of the early successes of AI took place in relatively sterile and formal en-

vironments and did not require computers to have much knowledge about the world.

For example, IBM’s Deep Blue chess-playing system defeated world champion Garry

Kasparov in 1997 (Hsu, 2002). Chess is of course a very simple world, containing only

sixty-four locations and thirty-two pieces that can move in only rigidly circumscribed

ways. Devising a successful chess strategy is a tremendous intellectual accomplishment,

but does not require much knowledge about the agent’s environment. The environment

can be described by a very brief list of rules, easily provided ahead of time by the

programmer.

Ironically, abstract and formal tasks such as chess that are among the most diﬃcult

mental undertakings for a human being are among the easiest for a computer. A person’s

everyday life requires an immense amount of knowledge about the world, and much

of this knowledge is subjective and intuitive, and therefore diﬃcult to articulate in a

formal way. Yet, computers require some form of knowledge in order to make intelligent

decisions. Where is that knowledge going to come from?

Several artiﬁcial intelligence projects have sought to hard-code knowledge about the

world in formal languages. A computer can reason about statements in these formal

languages automatically using logical inference rules. None of these projects has lead

to a major success. One of the most famous such projects is Cyc. Cyc is an inference

engine and a database of statements in a language called CycL. These statements are

entered by a staﬀ of human supervisors. It is an unwieldy process. People struggle

to devise formal rules with enough complexity to accurately describe the world. For

example, Cyc failed to understand a story about a person named Fred shaving in the

morning (Linde, 1992). Its inference engine detected an inconsistency in the story—it

knew that people do not have electrical parts, but because Fred was holding an electric

razor, it believed the entity “FredWhileShaving” contained electrical parts. It therefore

asked whether Fred was still a person while he was shaving.

The diﬃculties faced by systems relying on hard-coded knowledge suggest that AI

systems need the ability to acquire their own knowledge, by extracting patterns from

raw data. This capability is known as machine learning. The introduction of machine

learning allowed computers to tackle problems involving knowledge of the real world

and make decisions that appear subjective. A simple machine learning algorithm called

logistic regression can determine whether to recommend cesarean delivery (Mor-Yosef

et al., 1990). A simple machine learning algorithm called naive Bayes can separate

legitimate e-mail from spam e-mail. What we call a learning machine or more generally

learner is the agent that executes the learning procedure, that takes training data as

input and yields a change in the agent (or mathematically, a function).

The performance of these simple machine learning algorithms depends heavily on

the representation of the data they are given. For example, when logistic regression

is used to recommend cesarean delivery, the AI system does not examine the patient

directly. Instead, the doctor tells the system several pieces of relevant information, such

as the presence or absence of a uterine scar. Each piece of information included in the

representation of the patient is known as a feature. Logistic regression learns how each

of these features of the patient correlates with various outcomes. However, it cannot

learn what features are useful, nor can it observe the features itself. If logistic regression

was given a 3-D MRI image of the patient, rather than the doctor’s formalized report,

it would not be able to make useful predictions. Individual voxels

in an MRI scan have

negligible correlation with any complications that might occur during delivery.

This dependence on representations is a general phenomenon that appears through-

out computer science and even daily life. In computer science, operations such as search-

ing a collection of data can proceed exponentially faster if the collection is structured

and indexed intelligently. People can easily perform arithmetic on Arabic numerals, but

ﬁnd arithmetic on Roman numerals much more time consuming. It is not surprising

that the choice of representation has an enormous eﬀect on the performance of machine

A voxel is the value at a single point in a 3-D scan, much as a pixel as the value at a single point

in an image.

learning algorithms.

Example of representation

Data can be represented in diﬀerent ways, but some representations make it

easier for machine learning algorithms to capture the knowledge they provide.

For example, a number can be represented by its binary encoding (with n bits),

by a single real-valued scalar or by its one-hot encoding (with 2

bits of which

only one is turned on). In many cases, the compact binary representation is

a poor choice for learning algorithms, because two very nearby values (like 3,

encoded as binary 00000011, and 4, encoded as binary 00000100) have no digits

in common while two values that are very diﬀerent (like binary 10000001 = 129

and binary 00000001 = 1) only diﬀer by one digit. This makes it diﬃcult for

the learning machine to generalize from examples to numerically close ones.

However, in many application we expect that what is true for input x is often

true for input x +  for a small . This is called the smoothness prior and is ex-

ploited in most applications of machine learning that involve real numbers, and

to some extent other data types in which some meaningful notion of similarity

can be deﬁned.

Many artiﬁcial intelligence tasks can be solved by designing the right set of features

to extract for that task, then providing these features to a simple machine learning

algorithm. For example, a useful feature for speaker identiﬁcation from sound is the

pitch. The pitch can be formally speciﬁed—it is the lowest frequency major peak of the

spectrogram. It is useful for speaker identiﬁcation because it is determined by the size

of the vocal tract, and therefore gives a strong clue as to whether the speaker is a man,

woman, or child.

However, for many tasks, it is diﬃcult to know what features should be extracted. For

example, suppose that we would like to write a program to detect cars in photographs.

We know that cars have wheels, so we might like to use the presence of a wheel as a

feature. Unfortunately, it is diﬃcult to describe exactly what a wheel looks like in terms

of pixel values. A wheel has a simple geometric shape but its image may be complicated

by shadows falling on the wheel, the sun glaring oﬀ the metal parts of the wheel, the

fender of the car or an object in the foreground obscuring part of the wheel, and so on.

One solution to this problem is to use machine learning to discover not only the map-

ping from representation to output but also the representation itself. This approach is

known as representation learning. Learned representations often result in much better

performance than can be obtained with hand-designed representations. They also al-

low AI systems to rapidly adapt to new tasks, with minimal human intervention. A

representation learning algorithm can discover a good set of features for a simple task

in minutes, or a complex task in hours to months. Manually designing features for a

complex task requires a great deal of human time and eﬀort; it can take decades for an

entire community of researchers.

When designing features or algorithms for learning features, our goal is usually to

separate the factors of variation that explain the observed data. In this context, we use

the word “factors” simply to refer to separate sources of inﬂuence; the factors are usually

not combined by multiplication. Such factors are often not quantities that are directly

observed but they exist in the minds of humans as explanations or inferred causes of

the observed data. They can be thought of as concepts or abstractions that help us

make sense of the rich variability in the data. When analyzing a speech recording, the

factors of variation include the speaker’s age and sex, their accent, and the words that

they are speaking. When analyzing an image of a car, the factors of variation include

the position of the car, its color, and the angle and brightness of the sun.

A major source of diﬃculty in many real-world artiﬁcial intelligence applications is

that many of the factors of variation inﬂuence every single piece of data we are able to

observe. The individual pixels in an image of a red car might be very close to black at

night. The shape of the car’s silhouette depends on the viewing angle. Most applications

require us to disentangle the factors of variation and discard the ones that we do not

care about.

Of course, it can be very diﬃcult to extract such high-level, abstract features from

raw data. Many of these factors of variation, such as a speaker’s accent, also require

sophisticated, nearly human-level understanding of the data. When it is nearly as diﬃ-

cult to obtain a representation as to solve the original problem, representation learning

does not, at ﬁrst glance, seem to help us.

Deep learning solves this central problem in representation learning by introducing

representations that are expressed in terms of other, simpler representations. Deep

learning allows the computer to build complex concepts out of simpler concepts. Fig. 1.1

shows how a deep learning system can represent the concept of an image of a person by

combining simpler concepts, such as corners and contours, which are in turn deﬁned in

terms of edges.

Another perspective on deep learning is that it allows the computer to learn a multi-

step computer program. Each layer of the representation can be thought of as the

state of the computer’s memory after executing another set of instructions in parallel.

Networks with greater depth can execute more instructions in sequence. Being able

to execute instructions sequentially oﬀers great power because later instructions can

refer back to the results of earlier instructions. According to this view of deep learning,

not all of the information in a layer’s representation of the input necessarily encodes

factors of variation that explain the input. The representation is also used to store state

information that helps to execute a program that can make sense of the input.

“Depth” is not a mathematically rigorous term in this context; there is no formal

deﬁnition of deep learning. All approaches to deep learning share the idea of nested

representations of data, but diﬀerent approaches view depth in diﬀerent ways. For some

approaches, the depth of the system is the depth of the ﬂowchart describing the com-

putations needed to produce the ﬁnal representation. The depth corresponds roughly

to the number of times we update the representation. Other approaches consider depth

to be the depth of the graph describing how concepts are related to each other. In this

case, the depth of the ﬂow-chart of the computations needed to compute the represen-

tation of each concept may be much deeper than the graph of the concepts themselves.

Visualizing and Understanding Convolutional Networks

(a)

(b)

(e)

Figure 6. (a): 1st layer features without feature scale clipping. Note that one feature dominates. (b): 1st layer features

from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and ﬁlter size (7x7 vs 11x11)

results in more disti n c ti ve features and fewer “dead” features. (d): Visualizations of 2nd layer features from (Krizhevsky

et al., 2012) . (e): Visualizatio n s of our 2nd layer featu res. These are cleaner, with no aliasing artifacts that are visible in

(d).

Car wheel

Racer

Cab

Police van

Pomeranian

Tennis ball

Keeshond

Pekinese

Afghan hound

Gordon setter

Irish setter

Mortarboard

Fur coat

Academic gown

Australian terrier

Ice lolly

Vizsla

Neck brace

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.05

0.1

0.15

0.2

0.25

True Label: Pomeranian

(a) Input Image (b) Layer 5, strongest feature map

feature map projections

(d) Classier, probability

of correct class

(e) Classier, most

probable class

True Label: Car Wheel

True Label: Afghan Hound

Figure 7. Three test examples where we systematically cover up di↵erent portions of the scene w it h a g ray square (1st

column) and see how the top (layer 5) feature maps ((b) & (c)) and classiﬁer outpu t ((d ) & (e)) changes . (b): for each

position of the g ray scale, we record the tota l activation in on e layer 5 feature map (the one with the s tro n g es t response

in the unoccluded image). (c): a visualization of this fea tu re map projected down into the input image (black square),

along with visu a li z a ti o n s of t h is map from other images. The ﬁrst row example shows the strongest feature to be the

dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). (d): a map of correct

class probability, as a func t io n of the positio n of the gray square. E.g. when the do g ’s face is obscured, t h e probability

for “pomeranian” drop s s ig n i ﬁ c antly. (e): the most proba b l e la bel as a fun c t io n of occluder position. E.g. in the 1st row,

for most location s it is “pomeranian” , but if the dog’s face is obsc u red but not the ball, then it predicts “tenn i s ball”. In

the 2nd example, text on th e car is the stro n g es t feat u re in layer 5, but the class iﬁ er is most sens it ive to the whee l. The

3rd examp l e contains multiple o bjects. The strongest feature in layer 5 picks out the faces, but the classiﬁer is sensitive

to the dog (blue region in (d)), since it uses multiple feature maps.

Visualizing and Understanding Convolutional Networks

(a)

(b)

(e)

Figure 6. (a): 1st layer features without feature scale clipping. Note that one feature dominates. (b): 1st layer features

from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and ﬁlter size (7x7 vs 11x11)

results in more distin c t ive features and fewer “dead” features. (d): Visualizations of 2nd layer features from (Krizhevsky

et al., 2012) . (e): Visualization s of our 2nd layer features . These a re cleaner, with no aliasing art i fa ct s that are visible in

(d).

Car wheel

Racer

Cab

Police van

Pomeranian

Tennis ball

Keeshond

Pekinese

Afghan hound

Gordon setter

Irish setter

Mortarboard

Fur coat

Academic gown

Australian terrier

Ice lolly

Vizsla

Neck brace

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.05

0.1

0.15

0.2

0.25

True Label: Pomeranian

(a) Input Image (b) Layer 5, strongest feature map

feature map projections

(d) Classier, probability

of correct class

(e) Classier, most

probable class

True Label: Car Wheel

True Label: Afghan Hound

Figure 7. Three test examples where we systematically cover up d i↵ erent portions of the scene with a gray square (1st

column) and see how the top (layer 5) feature maps ((b) & (c)) and classiﬁer output ((d) & (e)) changes. (b): for each

position of the g ray s ca l e, we record the total activation in one layer 5 featu re map (the o n e w it h the strong es t response

in the u n occluded image). (c): a visualization of this feature map projected down into the i n p u t image (black square),

along with visu a li za t io n s of t h is map f ro m other images. The ﬁrst row example shows the strongest feature to be the

dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). (d ): a map of correct

class probability, as a funct io n of the positio n of the gray square. E.g. when the dog’s fa c e is obscured, the probability

for “pomeranian” drop s si g n iﬁ c a ntly. (e): the most probable l abel as a fun c t io n of occluder position. E.g. in the 1st row,

for most location s it is “pomeranian”, but if the dog’s face is obscu red but not the ball, then it pre d ic t s “tennis ba ll ”. In

the 2nd example, text on the car is the stron g e st feat u re in layer 5, but the classiﬁ e r is most sensiti ve to the wheel. The

3rd examp l e contains mul t ip l e objects. The stronges t feature in layer 5 picks out the faces , but the c l a ssi ﬁ er is sensit ive