
Visualizing and Understanding Convolutional Networks
(a)
(b)
(c) (d)
(e)
Figure 6. (a): 1st layer features without feature scale clipping. Note that one feature dominates. (b): 1st layer features
from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and filter size (7x7 vs 11x11)
results in more disti n c ti ve features and fewer “dead” features. (d): Visualizations of 2nd layer features from (Krizhevsky
et al., 2012) . (e): Visualizatio n s of our 2nd layer featu res. These are cleaner, with no aliasing artifacts that are visible in
(d).
Car wheel
Racer
Cab
Police van
Pomeranian
Tennis ball
Keeshond
Pekinese
Afghan hound
Gordon setter
Irish setter
Mortarboard
Fur coat
Academic gown
Australian terrier
Ice lolly
Vizsla
Neck brace
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.05
0.1
0.15
0.2
0.25
True Label: Pomeranian
(a) Input Image (b) Layer 5, strongest feature map
(c) Layer 5, strongest
feature map projections
(d) Classier, probability
of correct class
(e) Classier, most
probable class
True Label: Car Wheel
True Label: Afghan Hound
Figure 7. Three test examples where we systematically cover up di↵erent portions of the scene w it h a g ray square (1st
column) and see how the top (layer 5) feature maps ((b) & (c)) and classifier outpu t ((d ) & (e)) changes . (b): for each
position of the g ray scale, we record the tota l activation in on e layer 5 feature map (the one with the s tro n g es t response
in the unoccluded image). (c): a visualization of this fea tu re map projected down into the input image (black square),
along with visu a li z a ti o n s of t h is map from other images. The first row example shows the strongest feature to be the
dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). (d): a map of correct
class probability, as a func t io n of the positio n of the gray square. E.g. when the do g ’s face is obscured, t h e probability
for “pomeranian” drop s s ig n i fi c antly. (e): the most proba b l e la bel as a fun c t io n of occluder position. E.g. in the 1st row,
for most location s it is “pomeranian” , but if the dog’s face is obsc u red but not the ball, then it predicts “tenn i s ball”. In
the 2nd example, text on th e car is the stro n g es t feat u re in layer 5, but the class ifi er is most sens it ive to the whee l. The
3rd examp l e contains multiple o bjects. The strongest feature in layer 5 picks out the faces, but the classifier is sensitive
to the dog (blue region in (d)), since it uses multiple feature maps.
Visualizing and Understanding Convolutional Networks
(a)
(b)
(c) (d)
(e)
Figure 6. (a): 1st layer features without feature scale clipping. Note that one feature dominates. (b): 1st layer features
from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and filter size (7x7 vs 11x11)
results in more distin c t ive features and fewer “dead” features. (d): Visualizations of 2nd layer features from (Krizhevsky
et al., 2012) . (e): Visualization s of our 2nd layer features . These a re cleaner, with no aliasing art i fa ct s that are visible in
(d).
Car wheel
Racer
Cab
Police van
Pomeranian
Tennis ball
Keeshond
Pekinese
Afghan hound
Gordon setter
Irish setter
Mortarboard
Fur coat
Academic gown
Australian terrier
Ice lolly
Vizsla
Neck brace
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.05
0.1
0.15
0.2
0.25
True Label: Pomeranian
(a) Input Image (b) Layer 5, strongest feature map
(c) Layer 5, strongest
feature map projections
(d) Classier, probability
of correct class
(e) Classier, most
probable class
True Label: Car Wheel
True Label: Afghan Hound
Figure 7. Three test examples where we systematically cover up d i↵ erent portions of the scene with a gray square (1st
column) and see how the top (layer 5) feature maps ((b) & (c)) and classifier output ((d) & (e)) changes. (b): for each
position of the g ray s ca l e, we record the total activation in one layer 5 featu re map (the o n e w it h the strong es t response
in the u n occluded image). (c): a visualization of this feature map projected down into the i n p u t image (black square),
along with visu a li za t io n s of t h is map f ro m other images. The first row example shows the strongest feature to be the
dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). (d ): a map of correct
class probability, as a funct io n of the positio n of the gray square. E.g. when the dog’s fa c e is obscured, the probability
for “pomeranian” drop s si g n ifi c a ntly. (e): the most probable l abel as a fun c t io n of occluder position. E.g. in the 1st row,
for most location s it is “pomeranian”, but if the dog’s face is obscu red but not the ball, then it pre d ic t s “tennis ba ll ”. In
the 2nd example, text on the car is the stron g e st feat u re in layer 5, but the classifi e r is most sensiti ve to the wheel. The
3rd examp l e contains mul t ip l e objects. The stronges t feature in layer 5 picks out the faces , but the c l a ssi fi er is sensit ive
to the dog (blue region in (d)), since it uses multiple feature maps.
Visualizing and Understanding Convolutional Networks
(a)
(b)
(c) (d)
(e)
Figure 6. (a): 1st layer features without feature scale clipping. Note that one feature dominates. (b): 1st layer features
from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and filter size (7x7 vs 11x11)
results in more distin c t ive features and fewer “dead” features. (d): Visualizations of 2nd layer features from (Krizhevsky
et al., 2012) . (e): Visualization s of our 2nd layer features . These a re cleaner, with no aliasing art i fa ct s that are visible in
(d).
Car wheel
Racer
Cab
Police van
Pomeranian
Tennis ball
Keeshond
Pekinese
Afghan hound
Gordon setter
Irish setter
Mortarboard
Fur coat
Academic gown
Australian terrier
Ice lolly
Vizsla
Neck brace
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.05
0.1
0.15
0.2
0.25
True Label: Pomeranian
(a) Input Image (b) Layer 5, strongest feature map
(c) Layer 5, strongest
feature map projections
(d) Classier, probability
of correct class
(e) Classier, most
probable class
True Label: Car Wheel
True Label: Afghan Hound
Figure 7. Three test examples where we systematically cover up d i↵ erent portions of the scene with a gray square (1st
column) and see how the top (layer 5) feature maps ((b) & (c)) and classifier output ((d) & (e)) changes. (b): for each
position of the g ray s ca l e, we record the total activation in one layer 5 featu re map (the o n e w it h the strong es t response
in the u n occluded image). (c): a visualization of this feature map projected down into the i n p u t image (black square),
along with visu a li za t io n s of t h is map f ro m other images. The first row example shows the strongest feature to be the
dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). (d ): a map of correct
class probability, as a funct io n of the positio n of the gray square. E.g. when the dog’s fa c e is obscured, the probability
for “pomeranian” drop s si g n ifi c a ntly. (e): the most probable l abel as a fun c t io n of occluder position. E.g. in the 1st row,
for most location s it is “pomeranian”, but if the dog’s face is obscu red but not the ball, then it pre d ic t s “tennis ba ll ”. In
the 2nd example, text on the car is the stron g e st feat u re in layer 5, but the classifi e r is most sensiti ve to the wheel. The
3rd examp l e contains mul t ip l e objects. The stronges t feature in layer 5 picks out the faces , but the c l a ssi fi er is sensit ive
to the dog (blue region in (d)), since it uses multiple feature maps.
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a fully trained model. For layers 2-5 we show the top 9 activati o n s in a ran do m subset
of fea tu re maps across the validation dat a , projected down to pixel space using our deco nvo l u ti o n a l network a p p ro a ch.
Our rec o n st ru c t io n s are not samples from the model: they are reconstructed patterns from the validation set that cause
high activations in a given feature map. For each feature map we also show the corres ponding imag e patches. Note:
(i) the the strong grouping within each feature map, (ii) g rea t er invariance at higher layers and (iii) exaggera t io n of
discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form.
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a fully trained model. For layers 2-5 we show the top 9 activations in a random subset
of feat u re maps a cro s s the validation d a ta , projected down to pixel spa ce using our deconvolution a l network ap p ro a ch.
Our rec on s t ru ct i on s are not samples from the model: they are reconstructed patterns from the validation set that cause
high activations in a given fea t u re map. For each feature map we a ls o show the corresponding image patches. Note:
(i) the the s tro n g group i n g with in each feature map, (ii) g rea t er invariance at higher layers and (iii) exaggeration of
discriminative parts o f the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form.
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a fully trained model. For layers 2-5 we show the top 9 activations in a random subset
of feat u re maps a cro s s the validation d a ta , projected down to pixel spa ce using our deconvolution a l network ap p ro a ch.
Our rec on s t ru ct i on s are not samples from the model: they are reconstructed patterns from the validation set that cause
high activations in a given fea t u re map. For each feature map we a ls o show the corresponding image patches. Note:
(i) the the s tro n g group i n g with in each feature map, (ii) g rea t er invariance at higher layers and (iii) exaggeration of
discriminative parts o f the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form.
Visualizing and Understanding Convolutional Networks
Figure 2. Vi su a l iz a t io n of fea t u res in a fully trained model. For layers 2-5 we show the to p 9 activation s in a rando m subset
of feat u re ma p s acros s the va li d at i o n data , projec t ed down to pixel s p a c e using our de co nvol u ti o n a l n etwork appro a ch.
Our re co n s tru c t io n s are not samples from t h e model: they are recon s t ruct ed patt ern s from t h e valida ti o n set t h a t cause
high a c t ivation s in a given featu re m a p . For each feature map we also show the correspondin g image p a t ches. Note:
(i) the the strong group in g wit h in each feature map, (ii) greater inva ri an c e at high er layers and (iii) exaggera ti o n of
discrim in a t ive parts o f the image, e.g . eyes and noses of dog s (layer 4, row 1, cols 1 ) . Bes t viewed in electron ic form.
Visible layer
(input pixels)
1st hidden layer
(edges)
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a ful ly trained model. For layers 2-5 we show the top 9 activations in a random subset
of feature maps across the validation data, projected down to pixel space using our deconvolutional network approach.
Our reconstructions are not samples from the model: they are reconstructed patte rn s from the validation set that cause
high activations in a given feature map. For each feature map we also show the corresponding image patches. Note:
(i) the the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exaggeration of
discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in el ec tro n i c form.
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a ful ly trained model. For layers 2-5 we show the top 9 activations in a random subset
of feature maps across the validation data, projected down to pixel space using our deconvolutional network approach.
Our reconstructions are not samples from the model: they are reconstructed patte rn s from the validation set that cause
high activations in a given feature map. For each feature map we also show the corresponding image patches. Note:
(i) the the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exaggeration of
discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in el ec tro n i c form.
Visualizing and Understanding Convolutional Networks
Figure 2. Visualization of features in a ful ly trained model. For layers 2-5 we show the top 9 activations in a random subset
of feature maps across the validation data, projected down to pixel space using our deconvolutional network approach.
Our reconstructions are not samples from the model: they are reconstructed patte rn s from the validation set that cause
high activations in a given feature map. For each feature map we also show the corresponding image patches. Note:
(i) the the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exaggeration of
discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in el ec tro n i c form.
2nd hidden layer
(corners and
contours)
3rd hidden layer
(object parts)
CAR PERSON ANIMAL
Output
(object identity)
Figure 1.1: Illustration of a deep learning model. It is difficult for a computer to un-
derstand the meaning of raw sensory input data, such as this image represented as a
collection of pixel values. The function mapping from a set of pixels to an object identity
is very complicated. Learning or evaluating this mapping seems insurmountable if tack-
led directly. Deep learning resolves this difficulty by breaking the desired complicated
mapping into a series of nested simple mappings, each described by a different layer of
the model. The input is presented at the visible layer. Then a series of hidden layers ex-
tracts increasingly abstract features from the image. The images here are visualizations
of the kind of feature represented by each hidden unit. Given the pixels, the first layer
can easily identify edges, by comparing the brightness of neighboring pixels. Given the
first hidden layer’s description of the edges, the second hidden layer can easily search for
corners and extended contours, which are recognizable as collections of edges. Given the
second hidden layer’s description of the image in terms of corners and contours, the third
hidden layer can detect entire parts of specific objects, by finding specific collections of
contours and corners. Finally, this description of the image in terms of the object parts
it contains can be used to recognize the objects present in the image. Images provided
by Zeiler and Fergus (2014).
6