
Visualizing and Understanding Convolutional Networks
(a)
(b)
(c) (d)
(e)
Figure 6. (a): 1st layer features without featu re scal e clip p ing . Note that one feature dominates. (b): 1st layer features
from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and filt er size (7x7 vs 11 x1 1 )
results in more disti n c ti ve features and fewer “dead” features. (d): Visualizations of 2nd layer fea tu re s from (Krizhev sky
et al., 2012) . (e): Visualizatio n s of our 2nd layer features. These are cleaner, with no aliasin g artifacts that are visible in
(d).
Car wheel
Racer
Cab
Police van
Pomeranian
Tennis ball
Keeshond
Pekinese
Afghan hound
Gordon setter
Irish setter
Mortarboard
Fur coat
Academic gown
Australian terrier
Ice lolly
Vizsla
Neck brace
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.05
0.1
0.15
0.2
0.25
True Label: Pomeranian
(a) Input Image (b) Layer 5, strongest feature map
(c) Layer 5, strongest
feature map projections
(d) Classier, probability
of correct class
(e) Classier, most
probable class
True Label: Car Wheel
True Label: Afghan Hound
Figure 7. Three test examples where we systema t i ca l ly cover up di↵erent portions of th e scene with a gray square (1st
column) and see how the top (layer 5) feature maps ((b) & (c)) and classifier outp u t ((d ) & (e)) changes. (b): for each
position of the gray scale, we record the total acti vation in one layer 5 feature map (the one with t he strongest response
in the unocclu d e d image). (c): a visualization of this feat u re map projected down into the input image (black square),
along with visu a li z a ti o n s of t h is map from other images. The first row example shows the st ron g e st feature to be t h e
dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). (d): a map of correct
class probability, as a func t io n of the position of t h e gray square. E.g. when the dog’s face is obscured, the probability
for “pomeranian” drops sig n i fi c antly. (e): the most proba b l e la bel as a funct io n of occ lu d e r position. E.g. in the 1st row,
for most locations it is “pomeranian”, but if the dog’s fa ce is obscured but not the ball, then it p red ic t s “tennis ba l l” . In
the 2nd example, text on th e car is the stro n g es t feat u re in layer 5, but the classifier is most sensitive to the wheel. The
3rd example contains multiple objects. The strongest feature in layer 5 picks out the faces, but the classifier is sensitive
to the dog (blue region in (d)), since it uses multip le feature maps.
Visualizing and Understanding Convolutional Networks
(a)
(b)
(c) (d)
(e)
Figure 6. (a): 1st layer features without feature scale clip p ing . Note that one feature dominates. (b): 1st layer features
from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and filter size (7x7 vs 11x11)
results in more distin c t ive features and fewer “dead” features. (d): Visualizations of 2nd layer featu res from (Krizhev sk y
et al., 2012) . (e): Visualization s of our 2nd layer features. These a re cleaner, with no aliasing a rti fa c ts that are visible in
(d).
Car wheel
Racer
Cab
Police van
Pomeranian
Tennis ball
Keeshond
Pekinese
Afghan hound
Gordon setter
Irish setter
Mortarboard
Fur coat
Academic gown
Australian terrier
Ice lolly
Vizsla
Neck brace
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.05
0.1
0.15
0.2
0.25
True Label: Pomeranian
(a) Input Image (b) Layer 5, strongest feature map
(c) Layer 5, strongest
feature map projections
(d) Classier, probability
of correct class
(e) Classier, most
probable class
True Label: Car Wheel
True Label: Afghan Hound
Figure 7. Three test examples where we systemat ic a ll y cover up di↵erent portions of the scene with a gray square ( 1 st
column) and see how the top (layer 5) feature maps ((b) & (c)) and classifier output ((d) & (e)) changes. (b): for each
position of the gray scale, we record the total activation in one layer 5 feature map (the o n e with the stron g es t respo n s e
in the u n occluded image). (c): a v is u al iz a t io n of this feat u re map projected down into the input image (black square),
along with visu a li za t io n s of t h is map f ro m other images. The first row example shows th e strongest feature to be the
dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). ( d ) : a map of correct
class probability, as a funct io n of the position of th e gray square. E.g. when t h e dog’s face is obscured, the probability
for “pomeranian” drops sig n ifi c a ntly. (e): the most probable l abel as a funct io n of occl u d er position. E.g. in the 1st row,
for most locations it is “pomeranian”, but if the dog’s face is obscured b u t not the b a ll , then it predicts “te n n is ball”. In
the 2nd example, text on the car is the stron g e st feat u re in layer 5, but the classifi er is most sensit ive to the wheel. The
3rd example contains multiple objects. The strongest feature in layer 5 picks ou t the fac es, but the clas si fi er is sensiti ve
to the dog (blue region in (d)), since it uses multiple feature maps.
Visualizing and Understanding Convolutional Networks
(a)
(b)
(c) (d)
(e)
Figure 6. (a): 1st layer features without feature scale clip p ing . Note that one feature dominates. (b): 1st layer features
from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and filter size (7x7 vs 11x11)
results in more distin c t ive features and fewer “dead” features. (d): Visualizations of 2nd layer featu res from (Krizhev sk y
et al., 2012) . (e): Visualization s of our 2nd layer features. These a re cleaner, with no aliasing a rti fa c ts that are visible in
(d).
Car wheel
Racer
Cab
Police van
Pomeranian
Tennis ball
Keeshond
Pekinese
Afghan hound
Gordon setter
Irish setter
Mortarboard
Fur coat
Academic gown
Australian terrier
Ice lolly
Vizsla
Neck brace
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.05
0.1
0.15
0.2
0.25
True Label: Pomeranian
(a) Input Image (b) Layer 5, strongest feature map
(c) Layer 5, strongest
feature map projections
(d) Classier, probability
of correct class
(e) Classier, most
probable class
True Label: Car Wheel
True Label: Afghan Hound
Figure 7. Three test examples where we systemat ic a ll y cover up di↵erent portions of the scene with a gray square ( 1 st
column) and see how the top (layer 5) feature maps ((b) & (c)) and classifier output ((d) & (e)) changes. (b): for each
position of the gray scale, we record the total activation in one layer 5 feature map (the o n e with the stron g es t respo n s e
in the u n occluded image). (c): a v is u al iz a t io n of this feat u re map projected down into the input image (black square),
along with visu a li za t io n s of t h is map f ro m other images. The first row example shows th e strongest feature to be the
dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). ( d ) : a map of correct
class probability, as a funct io n of the position of th e gray square. E.g. when t h e dog’s face is obscured, the probability
for “pomeranian” drops sig n ifi c a ntly. (e): the most probable l abel as a funct io n of occl u d er position. E.g. in the 1st row,
for most locations it is “pomeranian”, but if the dog’s face is obscured b u t not the b a ll , then it predicts “te n n is ball”. In
the 2nd example, text on the car is the stron g e st feat u re in layer 5, but the classifi er is most sensit ive to the wheel. The
3rd example contains multiple objects. The strongest feature in layer 5 picks ou t the fac es, but the clas si fi er is sensiti ve
to the dog (blue region in (d)), since it uses multiple feature maps.
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a fully trained model. For layers 2-5 we show th e top 9 activations in a random subset
of feature maps across the validation data, projected down to pixel space using our deconvolutional network approach.
Our rec o n st ru c t io n s are not samples from the model: they are reco n st ruc ted pat te rn s from the validation set that cause
high activations in a given feature map. Fo r each feature map we also show the corresponding image patches. No t e:
(i) the the strong grouping within each feature map, (ii) greater invariance at hig h er layers and (iii) exaggeration of
discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form.
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a fully train e d model. For layers 2-5 we show th e top 9 activations in a random subset
of feature maps across the validation data, projected down to pixel space using our deconvolution a l network approach.
Our rec on s t ru ct i on s are not samples from t h e model: they are reconstructed patterns from the validation set that cause
high activations in a given feature map. For each feature map we also show the corresponding image patches. Note:
(i) the the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exaggeration of
discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form.
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a fully train e d model. For layers 2-5 we show th e top 9 activations in a random subset
of feature maps across the validation data, projected down to pixel space using our deconvolution a l network approach.
Our rec on s t ru ct i on s are not samples from t h e model: they are reconstructed patterns from the validation set that cause
high activations in a given feature map. For each feature map we also show the corresponding image patches. Note:
(i) the the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exaggeration of
discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form.
Visualizing and Understanding Convolutional Networks
Figure 2. Vi su a l iz a t io n of fea t u res in a fully trained model. For layers 2-5 we show the top 9 activation s in a rando m subset
of feat u re ma p s across the va l id a t io n data , projec t ed down to pix el space using our dec o nvolutional network approach.
Our re co n s tru c t io n s are not samples from t h e model: they are recon s t ruct ed patt ern s from t h e valida t i on set that cause
high activation s in a g iven feature map. For each feature map we als o show the corresponding image patche s. Note:
(i) the the strong group in g wit h in each feature map, (i i) grea t er invariance at hig h er layers and (iii) exaggeratio n of
discrim in a t ive parts of t h e image, e.g. eyes and noses of dog s (layer 4, row 1, cols 1 ) . Be st viewed in ele ct ro n i c form.
Visible layer
(input pixels)
1st hidden layer
(edges)
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a ful ly trained model. For layers 2-5 we show the top 9 activations in a random subset
of feature maps a c ro ss the validation data, project ed down to pixel spa c e using our deconvolutional network approach.
Our reconstructions are not samples from the model: they are reconstructed p a t t ern s from the validation set tha t cause
high activations in a given feat u re map. For each feature map we also show the corresponding image patches. Note:
(i) the the strong grouping within each feature map, (ii) grea t er invariance at higher layers and (iii) exaggeration of
discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form.
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a ful ly trained model. For layers 2-5 we show the top 9 activations in a random subset
of feature maps a c ro ss the validation data, project ed down to pixel spa c e using our deconvolutional network approach.
Our reconstructions are not samples from the model: they are reconstructed p a t t ern s from the validation set tha t cause
high activations in a given feat u re map. For each feature map we also show the corresponding image patches. Note:
(i) the the strong grouping within each feature map, (ii) grea t er invariance at higher layers and (iii) exaggeration of
discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form.
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a ful ly trained model. For layers 2-5 we show the top 9 activations in a random subset
of feature maps a c ro ss the validation data, project ed down to pixel spa c e using our deconvolutional network approach.
Our reconstructions are not samples from the model: they are reconstructed p a t t ern s from the validation set tha t cause
high activations in a given feat u re map. For each feature map we also show the corresponding image patches. Note:
(i) the the strong grouping within each feature map, (ii) grea t er invariance at higher layers and (iii) exaggeration of
discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form.
2nd hidden layer
(corners and
contours)
3rd hidden layer
(object parts)
CAR PERSON ANIMAL
Output
(object identity)
Figure 1.2: Illustration of a deep learning model. It is difficult for a computer to un-
derstand the meaning of raw sensory input data, such as this image represented as a
collection of pixel values. The function mapping from a set of pixels to an object identity
is very complicated. Learning or evaluating this mapping seems insurmountable if tack-
led directly. Deep learning resolves this difficulty by breaking the desired complicated
mapping into a series of nested simple mappings, each described by a different layer of
the model. The input is presented at the visible layer, so named because it contains the
variables that we are able to observe. Then a series of hidden layers extracts increasingly
abstract features from the image. These layers are called “hidden” because their values
are not given in the data; instead the model must determine which concepts are useful
for explaining the relationships in the observed data. The images here are visualizations
of the kind of feature represented by each hidden unit. Given the pixels, the first layer
can easily identify edges, by comparing the brightness of neighboring pixels. Given the
first hidden layer’s description of the edges, the second hidden layer can easily search for
corners and extended contours, which are recognizable as collections of edges. Given the
second hidden layer’s description of the image in terms of corners and contours, the third
hidden layer can detect entire parts of specific objects, by finding specific collections of
contours and corners. Finally, this description of the image in terms of the object parts
it contains can be used to recognize the objects present in the image. Images reproduced
with permission from Zeiler and Fergus (2014).
9