
CHAPTER 1. INTRODUCTION
Visualizing and Understanding Convolutional Networks
(a)
(b)
(c) (d)
(e)
Figure 6. (a): 1st layer features without feature scale clipp in g. Note that one feature dominates. (b): 1st layer features
from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and filter size (7x7 vs 11x1 1 )
results in more distin ct i ve features and fewer “dead” features. (d): Visualizations o f 2nd layer features from (Krizhevsky
et al., 2012). (e): Visualizations o f our 2nd layer features. These are cleaner, wit h no aliasing artifacts that are visible in
(d).
Car wheel
Racer
Cab
Police van
Pomeranian
Tennis ball
Keeshond
Pekinese
Afghan hound
Gordon setter
Irish setter
Mortarboard
Fur coat
Academic gown
Australian terrier
Ice lolly
Vizsla
Neck brace
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.05
0.1
0.15
0.2
0.25
True Label: Pomeranian
(a) Input Image (b) Layer 5, strongest feature map
(c) Layer 5, strongest
feature map projections
(d) Classier, probability
of correct class
(e) Classier, most
probable class
True Label: Car Wheel
True Label: Afghan Hound
Figure 7. Three test examples where we syst em a ti c al ly cover up d i ↵erent portions of t h e scene with a gray square (1st
column) and see how the top (layer 5) feature maps ((b) & (c)) and classifier outpu t ((d ) & (e)) changes. (b): for each
posi ti o n of the gray scale, we record the total activation in one layer 5 feature map (the one wi t h the strongest respo n se
in the u n occluded image). (c): a visualization of this featu re map projected down into the input image ( b la ck square),
along with visu al iz a t io n s of t h is map from other images. The first row example shows th e strongest feature to be the
dog’s face. Wh en this is covered-up the activity in the feature map decreases (blue area in (b)). (d): a map of correct
class probability, as a funct io n of the position of th e gray squ a re. E.g. when the dog’s fa c e is obscured, the probability
for “po mera n i a n ” d ro p s sig n i fi c antly. (e): the most probable label as a function of occluder po s it io n . E.g. in the 1st row,
for most locations it is “pom era n i an ” , but if the dog’s fac e is obscured but not th e ball, t h en it predicts “tennis ba ll ” . In
the 2nd example, text on the ca r is the stron g e st feat u re in layer 5, but the classifier is most sensitive to the wheel. The
3rd example contains multiple objects. The stron ge st feature i n layer 5 picks out th e faces, but t h e classifier is sensitive
to the dog (blue region in (d)), since it uses multiple feature maps.
Visualizing and Understanding Convolutional Networks
(a)
(b)
(c) (d)
(e)
Figure 6. (a): 1st laye r feature s witho u t feature sca le clip p ing . Note that one feature dominates. (b): 1st layer features
from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and filter size (7x7 vs 11x11)
results in more distinc t ive feature s and fewer “dead” features. (d): Visualizations of 2nd layer features from (K riz h evs ky
et al., 2012). (e): Visualizations of our 2nd layer features. These are cleaner, wi th no aliasing artifacts that are visible in
(d).
Car wheel
Racer
Cab
Police van
Pomeranian
Tennis ball
Keeshond
Pekinese
Afghan hound
Gordon setter
Irish setter
Mortarboard
Fur coat
Academic gown
Australian terrier
Ice lolly
Vizsla
Neck brace
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.05
0.1
0.15
0.2
0.25
True Label: Pomeranian
(a) Input Image (b) Layer 5, strongest feature map
(c) Layer 5, strongest
feature map projections
(d) Classier, probability
of correct class
(e) Classier, most
probable class
True Label: Car Wheel
True Label: Afghan Hound
Figure 7. Three test examples where we systema t ic a l ly cover up di ↵ere nt portions of the scene with a gray square (1st
column) and see how the top (layer 5) feature maps ((b) & (c)) and classifier output ((d) & (e)) changes. (b): for each
posi ti o n of the gray scale, we record the total ac t ivation in one layer 5 feature map (the o n e with the stron g es t respon se
in the u n occluded image). (c): a vi su a li z a ti o n of this feat u re map projected dow n into the input image (black square),
along with visua li z a ti o n s of th i s map f ro m other i ma g es . The first row example shows the strongest feature to be the
dog’s face. When this is covered-up the activity in t h e feature map decreases (blue area in (b)). (d): a map of correct
class probability, as a functi o n of the position of the gray squa re. E.g. when the dog’s fac e is obscured, the probability
for “po mera n i a n ” d rop s si g n ifi c a ntly. (e): the most probable label as a function of occ l u d er position. E.g. in the 1st row,
for most locations it is “pomera n i a n ”, but if the dog’s face is obscured but not the ball, then it predicts “tennis ba ll ” . In
the 2nd example, text on the car is the stron g est fea tu re in layer 5, but the classifier is most sensitive to the wheel. Th e
3rd example contains multiple objects. The strong es t feature in layer 5 picks out th e faces, but th e classifier is sensitive
to the dog (blue region in (d)), since it uses mul ti p le feature maps.
Visualizing and Understanding Convolutional Networks
(a)
(b)
(c) (d)
(e)
Figure 6. (a): 1st laye r feature s witho u t feature sca le clip p ing . Note that one feature dominates. (b): 1st layer features
from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and filter size (7x7 vs 11x11)
results in more distinc t ive feature s and fewer “dead” features. (d): Visualizations of 2nd layer features from (K riz h evs ky
et al., 2012). (e): Visualizations of our 2nd layer features. These are cleaner, wi th no aliasing artifacts that are visible in
(d).
Car wheel
Racer
Cab
Police van
Pomeranian
Tennis ball
Keeshond
Pekinese
Afghan hound
Gordon setter
Irish setter
Mortarboard
Fur coat
Academic gown
Australian terrier
Ice lolly
Vizsla
Neck brace
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.05
0.1
0.15
0.2
0.25
True Label: Pomeranian
(a) Input Image (b) Layer 5, strongest feature map
(c) Layer 5, strongest
feature map projections
(d) Classier, probability
of correct class
(e) Classier, most
probable class
True Label: Car Wheel
True Label: Afghan Hound
Figure 7. Three test examples where we systema t ic a l ly cover up di ↵ere nt portions of the scene with a gray square (1st
column) and see how the top (layer 5) feature maps ((b) & (c)) and classifier output ((d) & (e)) changes. (b): for each
posi ti o n of the gray scale, we record the total ac t ivation in one layer 5 feature map (the o n e with the stron g es t respon se
in the u n occluded image). (c): a vi su a li z a ti o n of this feat u re map projected dow n into the input image (black square),
along with visua li z a ti o n s of th i s map f ro m other i ma g es . The first row example shows the strongest feature to be the
dog’s face. When this is covered-up the activity in t h e feature map decreases (blue area in (b)). (d): a map of correct
class probability, as a functi o n of the position of the gray squa re. E.g. when the dog’s fac e is obscured, the probability
for “po mera n i a n ” d rop s si g n ifi c a ntly. (e): the most probable label as a function of occ l u d er position. E.g. in the 1st row,
for most locations it is “pomera n i a n ”, but if the dog’s face is obscured but not the ball, then it predicts “tennis ba ll ” . In
the 2nd example, text on the car is the stron g est fea tu re in layer 5, but the classifier is most sensitive to the wheel. Th e
3rd example contains multiple objects. The strong es t feature in layer 5 picks out th e faces, but th e classifier is sensitive
to the dog (blue region in (d)), since it uses mul ti p le feature maps.
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a fully trained model. For layers 2-5 we show the top 9 activatio n s in a random subset
of feature maps across the validation data, projected down to pixel s p a c e using our deconvolutional network approach.
Our reco n st ru c t io n s are not samples from th e model: they are reconstructed patterns from the validation set that cause
high activa ti o n s in a given feature map. For each feature map we also show the correspo n d in g ima g e pat ches. Note:
(i) the the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exaggeration of
discriminative parts o f the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best v ie wed in electronic form.
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a fully trained mod el . For layers 2-5 we s h ow the top 9 a ct i vations i n a random subset
of feature ma p s across the validation data, projected down to pixel space usin g our deconvolutional network approa ch.
Our recon s t ru ct i on s are not s a mp l es from the model: they are reco n st ruc te d patterns from the validatio n set that cause
high activat io n s in a given feature map. For each feature map we also show the corresponding image patches. Note:
(i) the the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exa g g era t io n of
discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, c o ls 1). Best viewed in electronic form.
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a fully trained mod el . For layers 2-5 we s h ow the top 9 a ct i vations i n a random subset
of feature ma p s across the validation data, projected down to pixel space usin g our deconvolutional network approa ch.
Our recon s t ru ct i on s are not s a mp l es from the model: they are reco n st ruc te d patterns from the validatio n set that cause
high activat io n s in a given feature map. For each feature map we also show the corresponding image patches. Note:
(i) the the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exa g g era t io n of
discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, c o ls 1). Best viewed in electronic form.
Visualizing and Understanding Convolutional Networks
Figure 2 . V is u a li za t io n of fea t u res in a fully trained model. For laye rs 2-5 we show the top 9 activation s in a random subset
of featu re m a p s across the va l id a t io n data , projected down to p ix el space using o u r deconvolution a l network approach.
Our rec o n st ru c t io n s are not samples from th e mode l: they are recon s truc t ed patte rn s from th e vali d a ti o n set t h a t cause
high activation s in a given featu re map. For ea ch feature m ap we also show the corresponding image patches. Note:
(i) the the strong groupi n g with i n each feature map, (ii) greater invariance at higher layers and (iii) exaggerati o n of
discrimi n a ti ve parts o f t h e i ma g e, e. g . e yes and n o ses of d o g s ( l ayer 4, row 1, cols 1). Be st vi ewed in electronic form.
Visible layer
(input pixels)
1st hidden layer
(edges)
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a fully trained model. For l ayers 2-5 we show the top 9 activat io n s in a random subset
of feature maps across the validation da ta , projected down to pixel space using our deconvolutional network approach.
Our reco n st ru ct i o n s are not samples from the model: they are reconstructed patterns from the validation set that cause
high a ct i vations in a given feature map. For each feature map we also show the corresponding image patches. Note:
(i) the the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exaggeration o f
discriminative parts of the i ma g e, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in ele ct ro n i c form.
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a fully trained model. For l ayers 2-5 we show the top 9 activat io n s in a random subset
of feature maps across the validation da ta , projected down to pixel space using our deconvolutional network approach.
Our reco n st ru ct i o n s are not samples from the model: they are reconstructed patterns from the validation set that cause
high a ct i vations in a given feature map. For each feature map we also show the corresponding image patches. Note:
(i) the the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exaggeration o f
discriminative parts of the i ma g e, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in ele ct ro n i c form.
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a fully trained model. For l ayers 2-5 we show the top 9 activat io n s in a random subset
of feature maps across the validation da ta , projected down to pixel space using our deconvolutional network approach.
Our reco n st ru ct i o n s are not samples from the model: they are reconstructed patterns from the validation set that cause
high a ct i vations in a given feature map. For each feature map we also show the corresponding image patches. Note:
(i) the the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exaggeration o f
discriminative parts of the i ma g e, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in ele ct ro n i c form.
2nd hidden layer
(corners and
contours)
3rd hidden layer
(object parts)
CAR PERSON ANIMAL
Output
(object identity)
Figure 1.2: Illustration of a deep learning model. It is difficult for a computer to un-
derstand the meaning of raw sensory input data, such as this image represented as a
collection of pixel values. The function mapping from a set of pixels to an object identity
is very complicated. Learning or evaluating this mapping seems insurmountable if tack-
led directly. Deep learning resolves this difficulty by breaking the desired complicated
mapping into a series of nested simple mappings, each described by a different layer of
the model. The input is presented at the visible layer, so named because it contains the
variables that we are able to observe. Then a series of hidden layers extracts increasingly
abstract features from the image. These layers are called “hidden” because their values
are not given in the data; instead the model must determine which concepts are useful
for explaining the relationships in the observed data. The images here are visualizations
of the kind of feature represented by each hidden unit. Given the pixels, the first layer
can easily identify edges, by comparing the brightness of neighboring pixels. Given the
first hidden layer’s description of the edges, the second hidden layer can easily search for
corners and extended contours, which are recognizable as collections of edges. Given the
second hidden layer’s description of the image in terms of corners and contours, the third
hidden layer can detect entire parts of specific objects, by finding specific collections of
contours and corners. Finally, this description of the image in terms of the object parts
it contains can be used to recognize the objects present in the image. Images reproduced
with permission from Zeiler and Fergus (2014).
6