Chapter 9
Convolutional Networks
Convolutional networks (also known as convolutional neural networks or CNNs)
are a specialized kind of neural network for processing data that has a known,
grid-like topology. Examples include time-series data, which can be thought of
as a 1D grid taking samples at regular time intervals, and image data, which
can be thought of as a 2D grid of pixels. Convolutional networks have been
tremendously successful in practical applications. See chapter 13.1.2 for details.
The name “convolutional neural network” indicates that the network employs
a mathematical operation called convolution. Convolution is a specialized kind
of linear operation. Convolutional networks are simply neural networks
that use convolution in place of general matrix multiplication. KEY
IDEAIn this chapter, we will first describe what convolution is. Next, we will
explain the motivation behind using convolution in a neural network. We will
then describe an operation called pooling, which almost all convolutional networks
employ. Usually, the operation used in a convolutional neural network does not
correspond precisely to the definition of convolution as used in other fields such
as engineering or pure mathematics. We will describe several variants on the
convolution function that are widely used in practice for neural networks. We
will also show how convolution may be applied to many kinds of data, with
different numbers of dimensions. We then discuss means of making convolution
more efficient. We conclude with comments about the role convolutional networks
have played in the history of deep learning.
9.1 The Convolution Operation
In its most general form, convolution is an operation on two functions of a real-
valued argument. To motivate the definition of convolution, let’s start with ex-
amples of two functions we might use.
224
CHAPTER 9. CONVOLUTIONAL NETWORKS
Suppose we are tracking the location of a spaceship with a laser sensor. Our
laser sensor provides a single output x(t), the position spaceship at time t. Both
x and t are real-valued, i.e., we can get a different reading from the laser sensor
at any instant in time.
Now suppose that our laser sensor is somewhat noisy. To obtain a less noisy
estimate of the spaceship’s position, we would like to average together several
measurements. Of course, more recent measurements are more relevant, so we
will want this to be a weighted average that gives more weight to recent measure-
ments. We can do this with a weighting function w(a), where a is the age of a
measurement. If we apply such a weighted average operation at every moment,
we obtain a new function s providing a smoothed estimate of the position of the
spaceship:
s(t) =
Z
x(a)w(t a)da
This operation is called convolution. The convolution operation is typically
denoted with an asterisk:
s(t) = (x w)(t)
In our example, w needs to be a valid probability density function, or the
output is not a weighted average. Also, w needs to be 0 for all negative argu-
ments, or it will look into the future, which is presumably beyond our capabilities.
These limitations are particular to our example though. In general, convolution
is defined for any functions for which the above integral is defined, and may be
used for other purposes besides taking weighted averages.
In convolutional network terminology, the first argument (in this example,
the function x) to the convolution is often referred to as the input and the second
argument (in this example, the function w) as the kernel. The output is sometimes
referred to as the feature map.
In our example, the idea of a laser sensor that can provide measurements at
every instant in time is not realistic. Usually, when we work with data on a
computer, time will be discretized, and our sensor will provide data at regular
intervals. In our example, it might be more realistic to assume that our laser
provides one measurement once per second. t can then take on only integer
values. If we now assume that x and w are defined only on integer t, we can
define the discrete convolution:
s[t] = (x w)(t) =
X
a=−∞
x[a]w[t a]
225
CHAPTER 9. CONVOLUTIONAL NETWORKS
In machine learning applications, the input is usually an array of data and
the kernel is usually an array of learn-able parameters. Because each element of
the input and kernel must be explicitly stored separately, we usually assume that
these functions are zero everywhere but the finite set of points for which we store
the values. This means that in practice we can implement the infinite summation
as a summation over a finite number of array elements.
Finally, we often use convolutions over more than one axis at a time. For
example, if we use a two-dimensional image I as our input, we probably also
want to use a two-dimensional kernel K:
s[i, j] = (I K)[i, j] =
X
m
X
n
I[m, n]K[i m, j n]
Note that convolution is commutative, meaning we can equivalently write:
s[i, j] = (I K)[i, j] =
X
m
X
n
I[i m, j n]K[m, n]
Usually the latter view is more straightforward to implement in a machine
learning library, because there is less variation in the range of valid values of m
and n.
Discrete convolution can be viewed as multiplication by a matrix. However,
the matrix has several entries constrained to be equal to other entries. For exam-
ple, for univariate discrete convolution, each row of the matrix is constrained to be
equal to the row above shifted by one element. This is known as a Toeplitz matrix.
In two dimensions, a doubly block circulant matrix corresponds to convolution. In
addition to these constraints that several elements be equal to each other, con-
volution usually corresponds to a very sparse matrix (a matrix whose entries are
mostly equal to zero). This is because the kernel is usually much smaller than
the input image. Viewing convolution as matrix multiplication usually does not
help to implement convolution operations, but it is useful for understanding and
designing neural networks. Any neural network algorithm that works with matrix
multiplication and does not depend on specific properties of the matrix structure
should work with convolution, without requiring any further changes to the neural
network. Typical convolutional neural networks do make use of further special-
izations in order to deal with large inputs efficiently, but these are not strictly
necessary from a theoretical perspective.
9.2 Motivation
Convolution leverages three important ideas that can help improve a machine
learning system: sparse interactions, parameter sharing, and equivariant repre-
226
CHAPTER 9. CONVOLUTIONAL NETWORKS
Figure 9.1: Sparse connectivity, viewed from below: We highlight one input unit, x
3
, and
also highlight the output units in s that are affected by this unit. (Left) When s is
formed by convolution with a kernel of width 3, only three outputs are affected by x
3
.
(Right) When s is formed by matrix multiplication, connectivity is no longer sparse, so
all of the outputs are affected by x
3
.
sentations. Moreover, convolution provides a means for working with inputs of
variable size. We now describe each of these ideas in turn.
Traditional neural network layers use a matrix multiplication to describe the
interaction between each input unit and each output unit. This means every
output unit interacts with every input unit. Convolutional networks, however,
typically have sparse interactions (also referred to as sparse connectivity or sparse
weights). This is accomplished by making the kernel smaller than the input. For
example, when processing an image, the input image might have thousands or
millions of pixels, but we can detect small, meaningful features such as edges
with kernels that occupy only tens or hundreds of pixels. This means that we
need to store fewer parameters, which both reduces the memory requirements of
the model and improves its statistical efficiency. It also means that computing the
output requires fewer operations. These improvements in efficiency are usually
quite large. If there are m inputs and n outputs, then matrix multiplication
requires m × n parameters and the algorithms used in practice have O(m × n)
runtime (per example). If we limit the number of connections each output may
have to k, then the sparsely connected approach requires only k × n parameters
and O(k × n) runtime. For many practical applications, it is possible to obtain
good performance on the machine learning task while keeping k several orders of
magnitude smaller than m. For graphical demonstrations of sparse connectivity,
see Fig. 9.1 and Fig. 9.2. In a deep convolutional network, units in the deeper
layers may indirectly interact with a larger portion of the input, as shown in
Fig. 9.3. This allows the network to efficiently describe complicated interactions
between many variables by constructing such interactions from simple building
blocks that each describe only sparse interactions.
Parameter sharing refers to using the same parameter for more than one func-
tion in a model. In a traditional neural net, each element of the weight matrix
is used exactly once when computing the output of a layer. It is multiplied by
one element of the input, and then never revisited. As a synonym for parameter
227
CHAPTER 9. CONVOLUTIONAL NETWORKS
Figure 9.2: Sparse connectivity, viewed from above: We highlight one output unit, s
3
,
and also highlight the input units in x that affect this unit. These units are known as the
receptive field of s
3
. (Left) When s is formed by convolution with a kernel of width 3, only
three inputs affect s
3
. (Right) When s is formed by matrix multiplication, connectivity
is no longer sparse, so all of the inputs affect s
3
.
sharing, one can say that a network has tied weights, because the value of the
weight applied to one input is tied to the value of a weight applied elsewhere. In
a convolutional neural net, each member of the kernel is used at every position of
the input (except perhaps some of the boundary pixels, depending on the design
decisions regarding the boundary). The parameter sharing used by the convolu-
tion operation means that rather than learning a separate set of parameters for
every location, we learn only one set. This does not affect the runtime of forward
propagation–it is still O(k×n)–but it does further reduce the storage requirements
of the model to k parameters. Recall that k is usual several orders of magnitude
less than m. Since m and n are usually roughly the same size, k is practically
insignificant compared to m ×n. Convolution is thus dramatically more efficient
than dense matrix multiplication in terms of the memory requirements and sta-
tistical efficiency. For a graphical depiction of how parameter sharing works, see
Fig. 9.4.
As an example of both of these first two principles in action, Fig. 9.5 shows
how sparse connectivity and parameter sharing can dramatically improve the
efficiency of a linear function for detecting edges in an image.
In the case of convolution, the particular form of parameter sharing causes the
layer to have a property called equivariance to translation. To say a function is
equivariant means that if the input changes, the output changes in the same way.
Specifically, a function f(x) is equivariant to a function g if f(g(x)) = g(f (x)). In
the case of convolution, if we let g be any function that translate the input, i.e.,
shifts it, then the convolution function is equivariant to g. For example, define
g(x) such that for all i, g(x)[i] = x[i 1]. This shifts every element of x one
228
CHAPTER 9. CONVOLUTIONAL NETWORKS
x
1
x
2
x
3
h
2
h
1
h
3
x
4
h
4
x
5
h
5
g
2
g
1
g
3
g
4
g
5
Figure 9.3: The receptive field of the units in the deeper layers of a convolutional network
is larger than the receptive field of the units in the shallow layers. This effect increases
if the network includes architectural features like strided convolution or pooling. This
means that even though direct connections in a convolutional net are very sparse, units
in the deeper layers can be indirectly connected to all or most of the input image.
229
CHAPTER 9. CONVOLUTIONAL NETWORKS
Figure 9.4: Parameter sharing: We highlight the connections that use a particular pa-
rameter in two different models. (Left) We highlight uses of the central element of a
3-element kernel in a convolutional model. Due to parameter sharing, this single param-
eter is used at all input locations. (Right) We highlight the use of the central element of
the weight matrix in a fully connected model. This model has no parameter sharing so
the parameter is used only once.
unit to the right. If we apply this transformation to x, then apply convolution,
the result will be the same as if we applied convolution to x, then applied the
transformation to the output. When processing time series data, this means that
convolution produces a sort of timeline that shows when different features appear
in the input. If we move an event later in time in the input, the exact same
representation of it will appear in the output, just later in time. Similarly with
images, convolution creates a 2-D map of where certain features appear in the
input. If we move the object in the input, its representation will move the same
amount in the output. This is useful for when we know that same local function
is useful everywhere in the input. For example, when processing images, it is
useful to detect edges in the first layer of a convolutional network, and an edge
looks the same regardless of where it appears in the image. This property is not
always useful. For example, if we want to recognize a face, some portion of the
network needs to vary with spatial location, because the top of a face does not
look the same as the bottom of a face–the part of the network processing the top
of the face needs to look for eyebrows, while the part of the network processing
the bottom of the face needs to look for a chin.
Note that convolution is not equivariant to some other transformations, such
as changes in the scale or rotation of an image. Other mechanisms are necessary
for handling these kinds of transformations.
Finally, some kinds of data cannot be processed by neural networks defined by
matrix multiplication with a fixed-shape matrix. Convolution enables processing
of some of these kinds of data. We discuss this further in section 9.6.
230
CHAPTER 9. CONVOLUTIONAL NETWORKS
Figure 9.5: Efficiency of edge detection. The image on the right was formed by taking
each pixel in the original image and subtracting the value of its neighboring pixel on
the left. This shows the strength of all of the vertically oriented edges in the input
image, which can be a useful operation for object detection. Both images are 280 pixels
tall. The input image is 320 pixels wide while the output image is 319 pixels wide.
This transformation can be described by a convolution kernel containing 2 elements, and
requires 319 × 280 ×3 = 267, 960 floating point operations (two multiplications and one
addition per output pixel) to compute. To describe the same transformation with a matrix
multiplication would take 320 ×280 ×319 ×280, or over 8 billion, entries in the matrix,
making convolution 4 billion times more efficient for representing this transformation. The
straightforward matrix multiplication algorithm performs over 16 billion floating point
operations, making convolution roughly 60,000 times more efficient computationally. Of
course, most of the entries of the matrix would be zero. If we stored only the nonzero
entries of the matrix, then both matrix multiplication and convolution would require the
same number of floating point operations to compute. The matrix would still need to
contain 2 × 319 × 280 = 178, 640 entries. Convolution is an extremely efficient way of
describing transformations that apply the same linear transformation of a small, local
region across the entire input. (Photo credit: Paula Goodfellow)
231
CHAPTER 9. CONVOLUTIONAL NETWORKS
9.3 Pooling
A typical layer of a convolutional network consists of three stages (see Fig. 9.6).
In the first stage, the layer performs several convolutions in parallel to produce a
set of presynaptic activations. In the second stage, each presynaptic activation is
run through a nonlinear activation function, such as the rectified linear activation
function. This stage is sometimes called the detector stage. In the third stage,
we use a pooling function to modify the output of the layer further.
A pooling function replaces the output of the net at a certain location with a
summary statistic of the nearby outputs. For example, the max pooling operation
reports the maximum output within a rectangular neighborhood. Other popular
pooling functions include the average of a rectangular neighborhood, the L2 norm
of a rectangular neighborhood, or a weighted average based on the distance from
the central pixel.
In all cases, pooling helps to make the representation become invariant to
small translations of the input. This means that if we translate the input by
a small amount, the values of most of the pooled outputs do not change. See
Fig. 9.7 for an example of how this works. Invariance to local translation
can be a very useful property if we care more about whether some
feature is present than exactly where it is. For example, when determining KEY
IDEAwhether an image contains a face, we need not know the location of the eyes with
pixel-perfect accuracy, we just need to know that there is an eye on the left side
of the face and an eye on the right side of the face. In other contexts, it is more
important to preserve the location of a feature. For example, if we want to find a
corner defined by two edges meeting at a specific orientation, we need to preserve
the location of the edges well enough to test whether they meet.
The use of pooling can be viewed as adding an infinitely strong prior that
the function the layer learns must be invariant to small translations. When this
assumption is correct, it can greatly improve the statistical efficiency of the net-
work.
Pooling over spatial regions produces invariance to translation, but if we pool
over the outputs of separately parametrized convolutions, the features can learn
which transformations to become invariant to (see Fig. 9.8).
Because pooling summarizes the responses over a whole neighborhood, it is
possible to use fewer pooling units than detector units, by reporting summary
statistics for pooling regions spaced k pixels apart rather than 1 pixel apart. See
Fig. 9.9 for an example. This improves the computational efficiency of the network
because the next layer has roughly k times fewer inputs to process. When the
number of parameters in the next layer is a function of its input size (such as
when the next layer is fully connected and based on matrix multiplication) this
reduction in the input size can also result in improved statistical efficiency and
232
CHAPTER 9. CONVOLUTIONAL NETWORKS
Figure 9.6: The components of a typical convolutional neural network layer. There are two
commonly used sets of terminology for describing these layers. Left) In this terminology,
the convolutional net is viewed as a small number of relatively complex layers, with each
layer having many “stages.” In this terminology, there is a one-to-one mapping between
kernel tensors and network layers. In this book we generally use this terminology. Right)
In this terminology, the convolutional net is viewed as a larger number of simple layers;
every step of processing is regarded as a layer in its own right. This means that not every
“layer” has parameters.
233
CHAPTER 9. CONVOLUTIONAL NETWORKS
Figure 9.7: Max pooling introduces invariance. Left: A view of the middle of the output
of a convolutional layer. The bottom row shows outputs of the nonlinearity. The top
row shows the outputs of max pooling, with a stride of 1 between pooling regions and a
pooling region width of 3. Right: A view of the same network, after the input has been
shifted to the right by 1 pixel. Every value in the bottom row has changed, but only
half of the values in the top row have changed, because the max pooling units are only
sensitive to the maximum value in the neighborhood, not its exact location.
Figure 9.8: Example of learned invariances: If each of these filters drive units that appear
in the same max-pooling region, then the pooling unit will detect “5”s in any rotation.
By learning to have each filter be a different rotation of the “5” template, this pooling
unit has learned to be invariant to rotation. This is in contrast to translation invariance,
which is usually achieved by hard-coding the net to pool over shifted versions of a single
learned filter.
234
CHAPTER 9. CONVOLUTIONAL NETWORKS
0.1 1. 0.2
1. 0.2
0.1
0.1
0.0 0.1
Figure 9.9: Pooling with downsampling. Here we use max-pooling with a pool width of
3 and a stride between pools of 2. This reduces the representation size by a factor of 2,
which reduces the computational and statistical burden on the next layer. Note that the
final pool has a smaller size, but must be included if we do not want to ignore some of
the detector units.
reduced memory requirements for storing the parameters.
For many tasks, pooling is essential for handling inputs of varying size. For
example, if we want to classify images of variable size, the input to the classifi-
cation layer must have a fixed size. This is usually accomplished by varying the
size of and offset between pooling regions so that the classification layer always
receives the same number of summary statistics regardless of the input size. For
example, the final pooling layer of the network may be defined to output four sets
of summary statistics, one for each quadrant of an image, regardless of the image
size.
9.4 Convolution and Pooling as an Infinitely Strong
Prior
Recall the concept of a prior probability distribution from Chapter 5.3. This is a
probability distribution over the parameters of a model that encodes our beliefs
about what models are reasonable, before we have seen any data.
Priors can be considered weak or strong depending on how concentrated the
probability density in the prior is. A weak prior is a prior distribution with high
entropy, such a Gaussian distribution with high variance. Such a prior allows the
data to move the parameters more or less freely. A strong prior has very low
entropy, such as a Gaussian distribution with low variance. Such a prior plays a
more active role in determining where the parameters end up.
An infinitely strong prior places zero probability on some parameters and says
that these parameter values are completely forbidden, regardless of how much
support the data gives to those values.
We can imagine a convolutional net as being similar to a fully connected net,
235
CHAPTER 9. CONVOLUTIONAL NETWORKS
but with an infinitely strong prior over its weights. This infinitely strong prior
says that the weights for one hidden unit must be identical to the weights of its
neighbor, but shifted in space. The prior also says that the weights must be zero,
except for in the small, spatially contiguous receptive field assigned to that hidden
unit. Overall, we can think of the use of convolution as introducing an infinitely
strong prior probability distribution over the parameters of a layer. This prior
says that the function the layer should learn contains only local interactions and
is equivariant to translation. Likewise, the use of pooling is in infinitely strong
prior that each unit should be invariant to small translations.
Of course, implementing a convolutional net as a fully connected net with an
infinitely strong prior would be extremely computationally wasteful. But thinking
of a convolutional net as a fully connected net with an infinitely strong prior can
give us some insights into how convolutional nets work.
This view of convolution and pooling as an infinitely strong prior gives a few
insights into how convolution and pooling work.
One key insight is that convolution and pooling can cause underfitting. Like
any prior, convolution and pooling are only useful when the assumptions made
by the prior are reasonably accurate. If a task relies on preserving precision
spatial information, then using pooling on all features can cause underfitting.
(Some convolution network architectures (Szegedy et al., 2014) are designed to
use pooling on some channels but not on other channels, in order to get both
highly invariant features and features that will not underfit when the translation
invariance prior is incorrect) When a task involves incorporating information from
very distant locations in the input, then the prior imposed by convolution may
be inappropriate.
Another key insight from this view is that we should only compare convolu-
tional models to other convolutional models in benchmarks of statistical learning
performance. Models that do not use convolution would be able to learn even
if we permuted all of the pixels in the image. For many image datasets, there
are separate benchmarks for models that are permutation invariant and must dis-
cover the concept of topology via learning, and models that have the knowledge
of spatial relationships hard-coded into them by their designer.
9.5 Variants of the Basic Convolution Function
When discussing convolution in the context of neural networks, we usually do
not refer exactly to the standard discrete convolution operation as it is usually
understood in the mathematical literature. The functions used in practice differ
slightly. Here we describe these differences in detail, and highlight some useful
properties of the functions used in neural networks.
236
CHAPTER 9. CONVOLUTIONAL NETWORKS
First, when we refer to convolution in the context of neural networks, we usu-
ally actually mean an operation that consists of many applications of convolution
in parallel. This is because convolution with a single kernel can only extract one
kind of feature, albeit at many spatial locations. Usually we want each layer of
our network to extract many kinds of features, at many locations.
Additionally, the input is usually not just a grid of real values. Rather, it is a
grid of vector-valued observations. For example, a color image has a red, green,
and blue intensity at each pixel. In a multilayer convolutional network, the input
to the second layer is the output of the first layer, which usually has the output
of many different convolutions at each position. When working with images, we
usually think of the input and output of the convolution as being 3-D arrays, with
one index into the different channels and two indices into the spatial coordinates
of each channel. (Software implementations usually work in batch mode, so they
will actually use 4-D arrays, with the fourth axis indexing different examples in
the batch)
Note that because convolutional networks usually use multi-channel convolu-
tion, the linear operations they are based on are not guaranteed to be commu-
tative, even if kernel-flipping is used. These multi-channel operations are only
commutative if each operation has the same number of output channels as input
channels.
Convolution is traditionally defined to index the kernel in the opposite or-
der as the input. This definition makes convolution commutative, which can be
convenient for writing proofs. In the context of neural networks, the commuta-
tive property of convolution is not especially helpful, so many implementations of
convolution index both the kernel and the input in the same order.
Assume we have a 4-D kernel array K with element k
i,j,k,l
giving the connec-
tion strength between a unit in channel i of the output and a unit in channel
j of the input, with an offset of k rows and l columns between the output unit
and the input unit. Assume our input consists of observed data V with element
v
i,j,k
giving the value of the input unit within channel i at row j and column k.
Assume our output consists of Z with the same format as V. If Z is produced by
convolving K across V without flipping K, then
z
i,j,k
=
X
l,m,n
v
l,j+m,k+n
k
i,l,m,n
where the summation over l, m, and n is over all values for which the array
indexing operations inside the summation is valid.
We may also want to skip over some positions of the kernel in order to reduce
the computational cost (at the expense of not extracting our features as finely).
We can think of this as downsampling the output of the full convolution function.
237
CHAPTER 9. CONVOLUTIONAL NETWORKS
If we want to sample only every s pixels in each direction in the output, then we
can defined a downsampled convolution function c such that:
z
i,j,k
= c(K, V , s)
i,j,k
=
X
l,m,n
[v
l,j×s+m,k×s+n
k
i,l,m,n
] . (9.1)
We refer to s as the stride of this downsampled convolution. It is also possible
to define a separate stride for each direction of motion. TODO add a figure for
this (if it’s not already in Mehdi’s list)
One essential feature of any convolutional network implementation is the abil-
ity to implicitly zero-pad the input V in order to make it wider. Without this
feature, the width of the representation shrinks by the kernel width - 1 at each
layer. Zero padding the input allows us to control the kernel width and the size of
the output independently. Without zero padding, we are forced to choose between
shrinking the spatial extent of the network rapidly and using small kernels–both
scenarios that significantly limit the expressive power of the network. See Fig. 9.10
for an example.
Three special cases of the zero-padding setting are worth mentioning. One is
the extreme case in which no zero-padding is used whatsoever, and the convolution
kernel is only allowed to visit positions where the entire kernel is contained entirely
within the image. In MATLAB terminology, this is called valid convolution. In
this case, all pixels in the output are a function of the same number of pixels in
the input, so the behavior of an output pixel is somewhat more regular. However,
the size of the output shrinks at each layer. If the input image is of size m×m and
the kernel is of size k ×k, the output will be of size mk +1×mk + 1. The rate
of this shrinkage can be dramatic if the kernels used are large. Since the shrinkage
is greater than 0, it limits the number of convolutional layers that can be included
in the network. As layers are added, the spatial dimension of the network will
eventually drop to 1 × 1, at which point additional layers cannot meaningfully
be considered convolutional. Another special case of the zero-padding setting
is when just enough zero-padding is added to keep the size of the output equal
to the size of the input. MATLAB calls this same convolution. In this case,
the network can contain as many convolutional layers as the available hardware
can support, since the operation of convolution does not modify the architectural
possibilities available to the next layer. However, the input pixels near the border
influence fewer output pixels than the input pixels near the center. This can
make the border pixels somewhat underrepresented in the model. This motivates
the other extreme case, which MATLAB refers to as full convolution, in which
enough zeroes are added for every pixel to be visited k times in each direction,
resulting in an output image of size m+k 1×m+ k 1. In this case, the output
pixels near the border are a function of fewer pixels than the output pixels near
238
CHAPTER 9. CONVOLUTIONAL NETWORKS
Figure 9.10: The effect of zero padding on network size: Consider a convolutional network
with a kernel of width six at every layer. In this example, do not use any pooling, so
only the convolution operation itself shrinks the network size. Top) In this convolutional
network, we do not use any implicit zero padding. This causes the representation to
shrink by five pixels at each layer. Starting from an input of sixteen pixels, we are only
able to have three convolutional layers, and the last layer does not ever move the kernel,
so arguably only two of the layers are truly convolutional. The rate of shrinking can
be mitigated by using smaller kernels, but smaller kernels are less expressive and some
shrinking is inevitable in this kind of architecture. Bottom) By adding five implicit zeroes
to each layer, we prevent the representation from shrinking with depth. This allows us
to make an arbitrarily deep convolutional network.
239
CHAPTER 9. CONVOLUTIONAL NETWORKS
the center. This can make it difficult to learn a single kernel that performs well
at all positions in the convolutional feature map. Usually the optimal amount of
zero padding (in terms of test set classification accuracy) lies somewhere between
“valid” and “same” convolution.
In some cases, we do not actually want to use convolution, but rather locally
connected layers. In this case, the adjacency matrix in the graph of our MLP is
the same, but every connection has its own weight, specified by a 6-D array W:
TODO– DWF wants better explanation of the 6 dimensions
z
i,j,k
=
X
l,m,n
[v
l,j+m,k+n
w
i,j,k,l,m,n
] .
Locally connected layers are useful when we know that each feature should
be a function of a small part of space, but there is no reason to think that the
same feature should occur across all of space. For example, if we want to tell if
an image is a picture of a face, we only need to look for the mouth in the bottom
half of the image.
It can also be useful to make versions of convolution or locally connected layers
in which the connectivity is further restricted, for example to constraint that each
output channel i be a function of only a subset of the input channels l.
Tiled convolution offers a compromise between a convolutional layer and a
locally connected layer. We make k be a 6-D tensor, where two of the dimensions
correspond to different locations in the output map. Rather than having a sep-
arate index for each location in the output map, output locations cycle through
a set of t different choices of kernel stack in each direction. If t is equal to the
output width, this is the same as a locally connected layer.
z
i,j,k
=
X
l,m,n
v
l,j+m,k+n
k
i,l,m,n,j%t,k%t
Note that it is straightforward to generalize this equation to use a different
tiling range for each dimension.
Both locally connected layers and tiled convolutional layers have an interesting
interaction with max-pooling: the detector units of these layers are driven by
different filters. If these filters learn to detect different transformed versions of
the same underlying features, then the max-pooled units become invariant to the
learned transformation (see Fig. 9.8). Convolutional layers are hard-coded to be
invariant specifically to translation.
Other operations besides convolution are usually necessary to implement a
convolutional network. To perform learning, one must be able to compute the
gradient with respect to the kernel, given the gradient with respect to the outputs.
In some simple cases, this operation can be performed using the convolution
240
CHAPTER 9. CONVOLUTIONAL NETWORKS
operation, but many cases of interest, including the case of stride greater than 1,
do not have this property.
Convolution is a linear operation and can thus be described as a matrix mul-
tiplication (if we first reshape the input array into a flat vector). The matrix
involved is a function of the convolution kernel. The matrix is sparse and each
element of the kernel is copied to several elements of the matrix. It is not usually
practical to implement convolution in this way, but it can be conceptually useful
to think of it in this way.
Multiplication by the transpose of the matrix defined by convolution is also a
useful operation. This is the operation needed to backpropagate error derivatives
through a convolutional layer, so it is needed to train convolutional networks
that have more than one hidden layer. This same operation is also needed to
compute the reconstruction in a convolutional autoencoder (or to perform the
analogous role in a convolutional RBM, sparse coding model, etc.). Like the
kernel gradient operation, this input gradient operation can be implemented using
a convolution in some cases, but in the general case requires a third operation to be
implemented. Care must be taken to coordinate this transpose operation with the
forward propagation. The size of the output that the transpose operation should
return depends on the zero padding policy and stride of the forward propagation
operation, as well as the size of the forward propagation’s output map. In some
cases, multiple sizes of input to forward propagation can result in the same size
of output map, so the transpose operation must be explicitly told what the size
of the original input was.
It turns out that these three operations–convolution, backprop from output
to weights, and backprop from output to inputs–are sufficient to compute all of
the gradients needed to train any depth of feedforward convolutional network,
as well as to train convolutional networks with reconstruction functions based
on the transpose of convolution. See (Goodfellow, 2010) for a full derivation
of the equations in the fully general multi-dimensional, multi-example case. To
give a sense of how these equations work, we present the two dimensional, single
example version here.
Suppose we want to train a convolutional network that incorporates strided
convolution of kernel stack K applied to multi-channel image V with stride s is
defined by c(K, V , s) as in equation 9.1. Suppose we want to minimize some loss
function J(V , K). During forward propagation, we will need to use c itself to
output Z, which is then propagated through the rest of the network and used
to compute J. . During backpropagation, we will receive an array G such that
G
i,j,k
=
z
i,j,k
J(V , K).
To train the network, we need to compute the derivatives with respect to the
241
CHAPTER 9. CONVOLUTIONAL NETWORKS
weights in the kernel. To do so, we can use a function
g(G, V , s)
i,j,k,l
=
k
i,j,k,l
J(V , K) =
X
m,n
g
i,m,n
v
j,m×s+k,n×s+l
.
If this layer is not the bottom layer of the network, we’ll need to compute the
gradient with respect to V in order to backpropagate the error farther down. To
do so, we can use a function
h(K, G, s)
i,j,k
=
v
i,j,k
J(V , K) =
X
l,m|s×l+m=j
X
n,p|s×n+p=k
X
q
k
q,i,m,p
g
i,l,n
.
We could also use h to define the reconstruction of a convolutional autoen-
coder, or the probability distribution over visible given hidden units in a convo-
lutional RBM or sparse coding model. Suppose we have hidden units H in the
same format as Z and we define a reconstruction
R = h(K, H, s).
In order to train the autoencoder, we will receive the gradient with respect
to R as an array E. To train the decoder, we need to obtain the gradient with
respect to K. This is given by g(H, E, s). To train the encoder, we need to obtain
the gradient with respect to H. This is given by c(K, E, s). It is also possible to
differentiate through g using c and h, but these operations are not needed for the
backpropagation algorithm on any standard network architectures.
Generally, we do not use only a linear operation in order to transform from the
inputs to the outputs in a convolutional layer. We generally also add some bias
term to each output before applying the nonlinearity. This raises the question
of how to share parameters among the biases. For locally connected layers it is
natural to give each unit its own bias, and for tiled convolution, it is natural to
share the biases with the same tiling pattern as the kernels. For convolutional
layers, it is typical to have one bias per channel of the output and share it across
all locations within each convolution map. However, if the input is of known,
fixed size, it is also possible to learn a separate bias at each location of the output
map. Separating the biases may slightly reduce the statistical efficiency of the
model, but also allows the model to correct for differences in the image statistics
at different locations. For example, when using implicit zero padding, detector
units at the edge of the image receive less total input and may need larger biases.
9.6 Data Types
The data used with a convolutional network usually consists of several channels,
each channel being the observation of a different quantity at some point in space
242
CHAPTER 9. CONVOLUTIONAL NETWORKS
or time. See Table 9.1 for examples of data types with different dimensionalities
and number of channels.
So far we have discussed only the case where every example in the train
and test data has the same spatial dimensions. One advantage to convolutional
networks is that they can also process inputs with varying spatial extents. These
kinds of input simply cannot be represented by traditional, matrix multiplication-
based neural networks. This provides a compelling reason to use convolutional
networks even when computational cost and overfitting are not significant issues.
For example, consider a collection of images, where each image has a different
width and height. It is unclear how to apply matrix multiplication. Convolution
is straightforward to apply; the kernel is simply applied a different number of
times depending on the size of the input, and the output of the convolution
operation scales accordingly. Sometimes the output of the network is allowed to
have variable size as well as the input, for example if we want to assign a class
label to each pixel of the input. In this case, no further design work is necessary.
In other cases, the network must produce some fixed-size output, for example if
we want to assign a single class label to the entire image. In this case we must
make some additional design steps, like inserting a pooling layer whose pooling
regions scale in size proportional to the size of the input, in order to maintain a
fixed number of pooled outputs.
Note that the use of convolution for processing variable sized inputs only makes
sense for inputs that have variable size because they contain varying amounts of
observation of the same kind of thing–different lengths of recordings over time,
different widths of observations over space, etc. Convolution does not make sense
if the input has variable size because it can optionally include different kinds
of observations. For example, if we are processing college applications, and our
features consist of both grades and standardized test scores, but not every ap-
plicant took the standardized test, then it does not make sense to convolve the
same weights over both the features corresponding to the grades and the features
corresponding to the test scores.
9.7 Efficient Convolution Algorithms
Modern convolutional network applications often involve networks containing
more than one million units. Powerful implementations exploiting parallel compu-
tation resources, as discussed in Chapter 11 are essential. However, in many cases
it is also possible to speed up convolution by selecting an appropriate convolution
algorithm.
Convolution is equivalent to converting both the input and the kernel to the
frequency domain using a Fourier transform, performing point-wise multiplication
243
CHAPTER 9. CONVOLUTIONAL NETWORKS
Single channel Multi-channel
1-D Audio waveform: The axis we
convolve over corresponds to
time. We discretize time and
measure the amplitude of the
waveform once per time step.
Skeleton animation data:
Animations of 3-D computer-
rendered characters are gen-
erated by altering the pose of
a “skeleton” over time. At
each point in time, the pose
of the character is described
by a specification of the angles
of each of the joints in the
character’s skeleton. Each chan-
nel in the data we feed to the
convolutional model represents
the angle about one axis of one
joint.
2-D Audio data that has been pre-
processed with a Fourier trans-
form: We can transform the au-
dio waveform into a 2D array
with different rows correspond-
ing to different frequencies and
different columns corresponding
to different points in time. Using
convolution in the time makes
the model equivariant to shifts in
time. Using convolution across
the frequency axis makes the
model equivariant to frequency,
so that the same melody played
in a different octave produces
the same representation but at a
different height in the network’s
output.
Color image data: One channel
contains the red pixels, one the
green pixels, and one the blue
pixels. The convolution kernel
moves over both the horizontal
and vertical axes of the image,
conferring translation equivari-
ance in both directions.
3-D Volumetric data: A common
source of this kind of data
is medical imaging technology,
such as CT scans.
Color video data: One axis cor-
responds to time, one to the
height of the video frame, and
one to the width of the video
frame.
Table 9.1: Examples of different formats of data that can be used with convolutional
networks.
244
CHAPTER 9. CONVOLUTIONAL NETWORKS
of the two signals, and converting back to the time domain using an inverse
Fourier transform. For some problem sizes, this can be faster than the naive
implementation of discrete convolution.
When a d-dimensional kernel can be expressed as the outer product of d
vectors, one vector per dimension, the kernel is called separable. When the kernel
is separable, naive convolution is inefficient. It is equivalent to compose d one-
dimensional convolutions with each of these vectors. The composed approach
is significantly faster than performing one k-dimensional convolution with their
outer product. The kernel also takes fewer parameters to represent as vectors.
If the kernel is w elements wide in each dimension, then naive multidimensional
convolution requires O(w
d
) runtime and parameter storage space, while separable
convolution requires O(w × d) runtime and parameter storage space. Of course,
not every convolution can be represented in this way.
Devising faster ways of performing convolution or approximate convolution
without harming the accuracy of the model is an active area of research. Even
techniques that improve the efficiency of only forward propagation are useful
because in the commercial setting, it is typical to devote many more resources to
deployment of a network than to its training.
9.8 The Neuroscientific Basis for Convolutional Net-
works
Convolutional networks are perhaps the greatest success story of biologically in-
spired artificial intelligence. Though convolutional networks have been guided
by many other fields, some of the key design principles of neural networks were
drawn from neuroscience.
The history of convolutional networks begins with neuroscientific experiments
long before the relevant computational models were developed. Neurophysiol-
ogists David Hubel and Torsten Wiesel collaborated for several years to deter-
mine many of the most basic facts about how the mammalian vision system
works (Hubel and Wiesel, 1959, 1962, 1968). Their accomplishements were even-
tually recognized with a Nobel Prize. Their findings that have had the greatest
influence on contemporary deep learning models were based on recording the
activity of individual neurons in cats. By anesthetizing the cat, they could im-
mobilize the cat’s eye and observe how neurons in the cat’s brain responded to
images projected in precise locations on a screen in front of the cat.
Their worked helped to characterize many aspects of brain function that are
beyond the scope of this book. From the point of view of deep learning, we can
focus on a simplified, cartoon view of brain function.
In this simplified view, we focus on a part of the brain called V1, also known
245
CHAPTER 9. CONVOLUTIONAL NETWORKS
as the primary visual cortex. V1 is the first area of the brain that begins to
perform significantly advanced processing of visual input. In this cartoon view,
images are formed by light arriving in the eye and stimulating the retina, the
light-sensitive tissue in the back of the eye. The neurons in the retina perform
some simple preprocessing of the image but do not substantially alter the way it
is represented. The image then passes through the optic nerve and a brain region
called the lateral geniculate nucleus. The main role, as far as we are concerned
here, of both of these anatomical regions is primarily just to carry the signal from
the eye to V1, which is located at the back of the head.
A convolutional network layer is designed to capture three properties of V1:
1. V1 is arranged in a spatial map. It actually has a two-dimensional struc-
ture mirroring the structure of the image in the retina. For example, light
arriving at the lower half of the retina affects only the corresponding half of
V1. Convolutional networks capture this property by having their features
defined in terms of two dimensional maps.
2. V1 contains many simple cells. A simple cell’s activity can to some extent be
characterized by a linear function of the image in a small, spatially localized
receptive field. The detector units of a convolutional network are designed
to emulate these properties of simple cells. V1 also contains many complex
cells. These cells respond to features that are similar to those detected by
simple cells, but complex cells are invariant to small shifts in the position
of the feature. This inspires the pooling units of convolutional networks.
Complex cells are also invariant to some changes in lighting that cannot
be captured simply by pooling over spatial locations. These invariances
have inspired some of the cross-channel pooling strategies in convolutional
networks, such as maxout units (Goodfellow et al., 2013a).
Though we know the most about V1, it is generally believed that the same
basic principles apply to other brain regions. In our cartoon view of the visual
system, the basic strategy of detection followed by pooling is repeatedly applied
as we move deeper into the brain. As we pass through multiple anatomical layers
of the brain, we eventually find cells that respond to some specific concept and are
invariant to many transformations of the input. These cells have been nicknamed
“grandmother cells”— the idea is that a person could have a neuron that activates
when seeing an image of their grandmother, regardless of whether she appears in
the left or right side of the image, whether the image is a close-up of her face or
zoomed out shot of her entire body, whether she is brightly lit, or in shadow, etc.
These grandmother cells have been shown to actually exist in the human brain,
in a region called the medial temporal lobe (Quiroga et al., 2005). Researchers
tested whether individual neurons would respond to photos of famous individuals,
246
CHAPTER 9. CONVOLUTIONAL NETWORKS
and found what has come to be called the “Halle Berry neuron”: an individual
neuron that is activated by the concept of Halle Berry. This neuron fires when
a person sees a photo of Halle Berry, a drawing of Halle Berry, or even text
containing the words “Halle Berry.” Of course, this has nothing to do with Halle
Berry herself; other neurons responded to the presence of Bill Clinton, Jennifer
Aniston, etc.
These medial temporal lobe neurons are somewhat more general than modern
convolutional networks, which would not automatically generalize to identifying
a person or object when reading its name. The closest analog to a convolutional
network’s last layer of features is a brain area called the inferotemporal cortex
(IT). When viewing an object, information flows from the retina, through the
LGN, to V1, then onward to V2, then V4, then IT. This happens within the first
100ms of glimpsing an object. If a person is allowed to continue looking at the
object for more time, then information will begin to flow backwards as the brain
uses top-down feedback to update the activations in the lower level brain areas.
However, if we interrupt the person’s gaze, and observe only the firing rates that
result from the first 100ms of mostly feed-forward activation, then IT proves to be
very similar to a convolutional network. Convolutional networks can predict IT
firing rates, and also perform very similarly to (time limited) humans on object
recognition tasks (DiCarlo, 2013).
That being said, there are many differences between convolutional networks
and the mammalian vision system. Some of these differences are well known
to computational neuroscientists, but outside the scope of this book. Some of
these differences are not yet known, because many basic questions about how the
mamalian vision system works. As a brief list:
The human eye is mostly very low resolution, except for a tiny patch called
the fovea. The fovea only observes an area about the size of a thumbnail
held at arms length. Though we feel as if we can see an entire scene in high
resolution, this is an illusion created by the subconscious part of our brain,
as it stitches together several glimpses of small areas. Most convolutional
networks actual receive large full resolution photographs as input.
The human visual system is integrated with many other senses, such as
hearing, and factors like our moods and thoughts. Convolutional networks
so far are purely visual.
The human visual system does much more than just recognize objects. It is
able to understand entire scenes including many objects and relationships
between objects, and processes rich 3-D geometric information needed for
our bodies to interface with the world. Convolutional networks have been
applied to some of these problems but these applications are in their infancy.
247
CHAPTER 9. CONVOLUTIONAL NETWORKS
Even simple brain areas like V1 are heavily impacted by feedback from
higher levels. Feedback has been explored extensively in neural network
models but has not yet been shown to offer a compelling improvement.
While feed-forward IT firing rates capture much of the same information as
convolutional network features, it’s not clear how similar the intermediate
computations are. The brain probably uses very different activation and
pooling functions. An individual neuron’s activation probably is not well-
characterized by a single linear filter response. A recent model of V1 involves
multiple quadratic filters for each neuron (Rust et al., 2005). Indeed our
cartoon picture of “simple cells” and “complex cells” might create a non-
existent distinction; simple cells and complex cells might both be the same
kind of cell but with their “parameters” enabling a continuum of behaviors
ranging from what we call “simple” to what we call “complex.”
It’s also worth mentioning that neuroscience has told us relatively little about
how to train convolutional networks. Model structures inspired by the work of
Hubel and Wiesel date back to the Neocognitron (Fukushima, 1980) but the
Neocognitron relied on a relatively heuristic learning algorithm. Convolutional
networks did not begin to work well until they were combined with the backprop-
agation algorithm (LeCun et al., 1989), which was not inspired by any neurosci-
entific observation and is considered by some to be biologically implausible.
Some of the most striking correspondences between neuroscience and machine
learning come from visually comparing the features learned by machine learning
models with those employed by V1. TODO describe revcor, Gabor functions,
describe simple cells TODO describe complex cells and quadrature pairs TODO
all sorts of patch-based unsupervised learning actually learns gabors TODO ga-
borings TODO reference to natural image statistics
9.9 Convolutional Networks and the History of Deep
Learning
Convolutional networks have played an important role in the history of deep
learning. They are a key example of a successful application of insights obtained
by studying the brain to machine learning applications. They were also some of
the first deep models to perform well, long before arbitrary deep models were
considered viable. Convolutional networks were also some of the first neural
networks to solve important commercial applications and remain at the forefront
of commercial applications of deep learning today. TODO conv nets were some
of first working deep backprop nets
248
CHAPTER 9. CONVOLUTIONAL NETWORKS
It is not entirely clear why convolutional networks succeeded when general
backpropagation networks were considered to have failed. It may simply be that
convolutional networks were more computationally efficient than fully connected
networks, so it was easier to run multiple experiments with them and tune their
implementation and hyperparameters. Larger networks also seem to be easier
to train. With modern hardware, fully connected networks appear to perform
reasonably on many tasks, even when using datasets that were available and
activation functions that were popular during the times when fully connected
networks were believed not to work well. It may be that the primary barriers
to the success of neural networks were psychological. Whatever the case, it is
fortunate that convolutional networks performed well and paved the way to the
acceptance of neural networks in general.
TODO early commercial applications (possibly just ref to applications chap-
ter) TODO contests won with conv nets TODO current commerical applications
(possibly just ref to applications chapter)
TODO: figure out where in the book to put: TODOsectionPooling and Linear-
by-Part Non-Linearities: Drednets and Maxout
249