Chapter 1

Introduction

Inventors have long dreamed of creating machines that think. Ancient Greek

myths tell of intelligent objects, such as animated statues of human beings and

tables that arrive full of food and drink when called.

When programmable computers were ﬁrst conceived, people wondered whether

they might become intelligent, over a hundred years before one was built (Lovelace,

1842). Today, artiﬁcial intelligence (AI) is a thriving ﬁeld with many practical

applications and active research topics. We look to intelligent software to auto-

mate routine labor, understand speech or images, make diagnoses in medicine,

and to support basic scientiﬁc research.

In the early days of artiﬁcial intelligence, the ﬁeld rapidly tackled and solved

problems that are intellectually diﬃcult for human beings but relatively straight-

forward for computers—problems that can be described by a list of formal, math-

ematical rules. The true challenge to artiﬁcial intelligence proved to be solving

the tasks that are easy for people to perform but hard for people to describe

formally—problems that we solve intuitively, that feel automatic, like recognizing

spoken words or faces in images.

This book is about a solution to these more intuitive problems. This solution

is to allow computers to learn from experience and understand the world in terms

of a hierarchy of concepts, with each concept deﬁned in terms of its relation

to simpler concepts. By gathering knowledge from experience, this approach

avoids the need for human operators to formally specify all of the knowledge that

the computer needs. The hierarchy of concepts allows the computer to learn

complicated concepts by building them out of simpler ones. If we draw a graph

showing how these concepts are built on top of each other, the graph is deep, with

many layers. For this reason, we call this approach to AI deep learning.

Many of the early successes of AI took place in relatively sterile and formal

environments and did not require computers to have much knowledge about the

CHAPTER 1. INTRODUCTION

world. For example, IBM’s Deep Blue chess-playing system defeated world cham-

pion Garry Kasparov in 1997 (Hsu, 2002). Chess is of course a very simple world,

containing only sixty-four locations and thirty-two pieces that can move in only

rigidly circumscribed ways. Devising a successful chess strategy is a tremendous

accomplishment, but the challenge is not due to the diﬃculty of describing the

relevant concepts to the computer. Chess can be completely described by a very

brief list of completely formal rules, easily provided ahead of time by the pro-

grammer.

Ironically, abstract and formal tasks that are among the most diﬃcult mental

undertakings for a human being are among the easiest for a computer. Com-

puters have long been able to defeat even the best human chess player, but are

only recently matching some of the abilities of average human beings to recog-

nize objects or speech. A person’s everyday life requires an immense amount of

knowledge about the world, and much of this knowledge is subjective and intu-

itive, and therefore diﬃcult to articulate in a formal way. Computers need to

capture this same knowledge in order to behave in an intelligent way. One of the

key challenges in artiﬁcial intelligence is how to get this informal knowledge into

a computer.

Several artiﬁcial intelligence projects have sought to hard-code knowledge

about the world in formal languages. A computer can reason about statements in

these formal languages automatically using logical inference rules. This is known

as the knowledge base approach to artiﬁcial intelligence. None of these projects

has lead to a major success. One of the most famous such projects is Cyc (Lenat

and Guha, 1989). Cyc is an inference engine and a database of statements in

a language called CycL. These statements are entered by a staﬀ of human su-

pervisors. It is an unwieldy process. People struggle to devise formal rules with

enough complexity to accurately describe the world. For example, Cyc failed to

understand a story about a person named Fred shaving in the morning (Linde,

1992). Its inference engine detected an inconsistency in the story: it knew that

people do not have electrical parts, but because Fred was holding an electric razor,

it believed the entity “FredWhileShaving” contained electrical parts. It therefore

asked whether Fred was still a person while he was shaving.

The diﬃculties faced by systems relying on hard-coded knowledge suggest that

AI systems need the ability to acquire their own knowledge, by extracting patterns

from raw data. This capability is known as machine learning. The introduction

of machine learning allowed computers to tackle problems involving knowledge

of the real world and make decisions that appear subjective. A simple machine

learning algorithm called logistic regression can determine whether to recommend

cesarean delivery (Mor-Yosef et al., 1990). A simple machine learning algorithm

called naive Bayes can separate legitimate e-mail from spam e-mail.

CHAPTER 1. INTRODUCTION

The performance of these simple machine learning algorithms depends heavily

on the representation of the data they are given. For example, when logistic

regression is used to recommend cesarean delivery, the AI system does not examine

the patient directly. Instead, the doctor tells the system several pieces of relevant

information, such as the presence or absence of a uterine scar. Each piece of

information included in the representation of the patient is known as a feature.

Logistic regression learns how each of these features of the patient correlates with

various outcomes. However, it cannot inﬂuence the way that the features are

deﬁned in any way. If logistic regression was given a 3-D MRI image of the

patient, rather than the doctor’s formalized report, it would not be able to make

useful predictions. Individual voxels

in an MRI scan have negligible correlation

with any complications that might occur during delivery.

This dependence on representations is a general phenomenon that appears

throughout computer science and even daily life. In computer science, operations

such as searching a collection of data can proceed exponentially faster if the collec-

tion is structured and indexed intelligently. People can easily perform arithmetic

on Arabic numerals, but ﬁnd arithmetic on Roman numerals much more time

consuming. It is not surprising that the choice of representation has an enormous

eﬀect on the performance of machine learning algorithms. For a simple visual

example, see Fig. 1.1.

Many artiﬁcial intelligence tasks can be solved by designing the right set of

features to extract for that task, then providing these features to a simple machine

learning algorithm. For example, a useful feature for speaker identiﬁcation from

sound is the pitch. The pitch can be formally speciﬁed—it is the lowest frequency

major peak of the spectrogram. It is useful for speaker identiﬁcation because it

is determined by the size of the vocal tract, and therefore gives a strong clue as

to whether the speaker is a man, woman, or child.

However, for many tasks, it is diﬃcult to know what features should be ex-

tracted. For example, suppose that we would like to write a program to detect

cars in photographs. We know that cars have wheels, so we might like to use the

presence of a wheel as a feature. Unfortunately, it is diﬃcult to describe exactly

what a wheel looks like in terms of pixel values. A wheel has a simple geometric

shape but its image may be complicated by shadows falling on the wheel, the sun

glaring oﬀ the metal parts of the wheel, the fender of the car or an object in the

foreground obscuring part of the wheel, and so on.

One solution to this problem is to use machine learning to discover not only

the mapping from representation to output but also the representation itself.

This approach is known as representation learning. Learned representations of-

A voxel is the value at a single point in a 3-D scan, much as a pixel as the value at a single

point in an image.

CHAPTER 1. INTRODUCTION

Figure 1.1: Example of diﬀerent representations: suppose we want to separate two cate-

gories of data by drawing a line between them in a scatterplot. In the plot on the left, we

represent some data using Cartesian coordinates, and the task is impossible. In the plot

on the right, we represent the data with polar coordinates and the task becomes simple

to solve with a vertical line. (Figure credit: David Warde-Farley)

ten result in much better performance than can be obtained with hand-designed

representations. They also allow AI systems to rapidly adapt to new tasks, with

minimal human intervention. A representation learning algorithm can discover a

good set of features for a simple task in minutes, or a complex task in hours to

months. Manually designing features for a complex task requires a great deal of

human time and eﬀort; it can take decades for an entire community of researchers.

The quintessential example of a representation learning algorithm is the au-

toencoder. An autoencoder is the combination of an encoder function that converts

the input data into a diﬀerent representation, and a decoder function that converts

the new representation back into the original format. Autoencoders are trained

to preserve as much information as possible when an input is run through the

encoder and then the decoder, but are also trained to make the new representa-

tion have various nice properties. Diﬀerent kinds of autoencoders aim to achieve

diﬀerent kinds of properties.

When designing features or algorithms for learning features, our goal is usually

to separate the factors of variation that explain the observed data. In this context,

we use the word “factors” simply to refer to separate sources of inﬂuence; the

factors are usually not combined by multiplication. Such factors are often not

quantities that are directly observed but they may exist either as unobserved

objects or forces in the physical world that aﬀect observable quantities, or they

are constructs in the human mind that provide useful simplifying explanations

CHAPTER 1. INTRODUCTION

or inferred causes of the observed data. They can be thought of as concepts or

abstractions that help us make sense of the rich variability in the data. When

analyzing a speech recording, the factors of variation include the speaker’s age

and sex, their accent, and the words that they are speaking. When analyzing an

image of a car, the factors of variation include the position of the car, its color,

and the angle and brightness of the sun.

A major source of diﬃculty in many real-world artiﬁcial intelligence applica-

tions is that many of the factors of variation inﬂuence every single piece of data

we are able to observe. The individual pixels in an image of a red car might be

very close to black at night. The shape of the car’s silhouette depends on the

viewing angle. Most applications require us to disentangle the factors of variation

and discard the ones that we do not care about.

Of course, it can be very diﬃcult to extract such high-level, abstract features

from raw data. Many of these factors of variation, such as a speaker’s accent,

can only be identiﬁed using sophisticated, nearly human-level understanding of

the data. When it is nearly as diﬃcult to obtain a representation as to solve the

original problem, representation learning does not, at ﬁrst glance, seem to help

us.

Deep learning solves this central problem in representation learning by intro-

ducing representations that are expressed in terms of other, simpler represen-

tations. Deep learning allows the computer to build complex concepts out of

simpler concepts. Fig. 1.2 shows how a deep learning system can represent the

concept of an image of a person by combining simpler concepts, such as corners

and contours, which are in turn deﬁned in terms of edges.

The quintessential example of a deep learning model is the multilayer percep-

tron (MLP). A multilayer perceptron is just a mathematical function mapping

some set of input values to output values. The function is formed by composing

many simpler functions. We can think of each application of a diﬀerent mathe-

matical function as providing a new representation of the input.

The idea of learning the right representation for the data provides one per-

spective on deep learning. Another perspective on deep learning is that it allows

the computer to learn a multi-step computer program. Each layer of the repre-

sentation can be thought of as the state of the computer’s memory after executing

another set of instructions in parallel. Networks with greater depth can execute

more instructions in sequence. Being able to execute instructions sequentially of-

fers great power because later instructions can refer back to the results of earlier

instructions. According to this view of deep learning, not all of the information

in a layer’s representation of the input necessarily encodes factors of variation

that explain the input. The representation is also used to store state information

that helps to execute a program that can make sense of the input. This state

CHAPTER 1. INTRODUCTION

Visualizing and Understanding Convolutional Networks

(a)

(b)

(e)

Figure 6. (a): 1st layer features without feature scale clipp in g. Note that one feature dominates. (b): 1st layer features

from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and ﬁlter size (7x7 vs 11x1 1 )

results in more distin ct i ve features and fewer “dead” features. (d): Visualizations o f 2nd layer features from (Krizhevsky

et al., 2012). (e): Visualizations o f our 2nd layer features. These are cleaner, wit h no aliasing artifacts that are visible in

(d).

Car wheel

Racer

Cab

Police van

Pomeranian

Tennis ball

Keeshond

Pekinese

Afghan hound

Gordon setter

Irish setter

Mortarboard

Fur coat

Academic gown

Australian terrier

Ice lolly

Vizsla

Neck brace

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.05

0.1

0.15

0.2

0.25

True Label: Pomeranian

(a) Input Image (b) Layer 5, strongest feature map

feature map projections

(d) Classier, probability

of correct class

(e) Classier, most

probable class

True Label: Car Wheel

True Label: Afghan Hound

Figure 7. Three test examples where we syst em a ti c al ly cover up d i ↵erent portions of t h e scene with a gray square (1st

column) and see how the top (layer 5) feature maps ((b) & (c)) and classiﬁer outpu t ((d ) & (e)) changes. (b): for each

posi ti o n of the gray scale, we record the total activation in one layer 5 feature map (the one wi t h the strongest respo n se

in the u n occluded image). (c): a visualization of this featu re map projected down into the input image ( b la ck square),

along with visu al iz a t io n s of t h is map from other images. The ﬁrst row example shows th e strongest feature to be the

dog’s face. Wh en this is covered-up the activity in the feature map decreases (blue area in (b)). (d): a map of correct

class probability, as a funct io n of the position of th e gray squ a re. E.g. when the dog’s fa c e is obscured, the probability

for “po mera n i a n ” d ro p s sig n i ﬁ c antly. (e): the most probable label as a function of occluder po s it io n . E.g. in the 1st row,

for most locations it is “pom era n i an ” , but if the dog’s fac e is obscured but not th e ball, t h en it predicts “tennis ba ll ” . In

the 2nd example, text on the ca r is the stron g e st feat u re in layer 5, but the classiﬁer is most sensitive to the wheel. The

3rd example contains multiple objects. The stron ge st feature i n layer 5 picks out th e faces, but t h e classiﬁer is sensitive

to the dog (blue region in (d)), since it uses multiple feature maps.

Visualizing and Understanding Convolutional Networks

(a)

(b)

(e)

Figure 6. (a): 1st laye r feature s witho u t feature sca le clip p ing . Note that one feature dominates. (b): 1st layer features

from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and ﬁlter size (7x7 vs 11x11)

results in more distinc t ive feature s and fewer “dead” features. (d): Visualizations of 2nd layer features from (K riz h evs ky

et al., 2012). (e): Visualizations of our 2nd layer features. These are cleaner, wi th no aliasing artifacts that are visible in

(d).

Car wheel

Racer

Cab

Police van

Pomeranian

Tennis ball

Keeshond

Pekinese

Afghan hound

Gordon setter

Irish setter

Mortarboard

Fur coat

Academic gown

Australian terrier

Ice lolly

Vizsla

Neck brace

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.05

0.1

0.15

0.2

0.25

True Label: Pomeranian

(a) Input Image (b) Layer 5, strongest feature map

feature map projections

(d) Classier, probability

of correct class

(e) Classier, most

probable class

True Label: Car Wheel

True Label: Afghan Hound

Figure 7. Three test examples where we systema t ic a l ly cover up di ↵ere nt portions of the scene with a gray square (1st

column) and see how the top (layer 5) feature maps ((b) & (c)) and classiﬁer output ((d) & (e)) changes. (b): for each

posi ti o n of the gray scale, we record the total ac t ivation in one layer 5 feature map (the o n e with the stron g es t respon se

in the u n occluded image). (c): a vi su a li z a ti o n of this feat u re map projected dow n into the input image (black square),

along with visua li z a ti o n s of th i s map f ro m other i ma g es . The ﬁrst row example shows the strongest feature to be the

dog’s face. When this is covered-up the activity in t h e feature map decreases (blue area in (b)). (d): a map of correct

class probability, as a functi o n of the position of the gray squa re. E.g. when the dog’s fac e is obscured, the probability

for “po mera n i a n ” d rop s si g n iﬁ c a ntly. (e): the most probable label as a function of occ l u d er position. E.g. in the 1st row,

for most locations it is “pomera n i a n ”, but if the dog’s face is obscured but not the ball, then it predicts “tennis ba ll ” . In

the 2nd example, text on the car is the stron g est fea tu re in layer 5, but the classiﬁer is most sensitive to the wheel. Th e

3rd example contains multiple objects. The strong es t feature in layer 5 picks out th e faces, but th e classiﬁer is sensitive

to the dog (blue region in (d)), since it uses mul ti p le feature maps.