Chapter 11
Convolutional Networks
Convolutional networks (also known as convolutional neural networks or CNNs) are a
specialized kind of neural network for processing data that has a known, grid-like topol-
ogy. Examples include time-series data, which can be thought of as a 1D grid taking
samples at regular time intervals, and image data, which can be thought of as a 2D grid of
pixels. Convolutional networks have been tremendously successful in practical applica-
tions. See chapter 20.1.2 for details. The name ā€œconvolutional neural networkā€ indicates
that the network employs a mathematical operation called convolution. Convolution is
a specialized kind of linear operation. Convolutional networks are simply neural
networks that use convolution in place of general matrix multiplication.
In this chapter, we will first describe what convolution is. Next, we will explain
the motivation behind using convolution in a neural network. We will then describe an
operation called pooling, which almost all convolutional networks employ. Usually, the
operation used in a convolutional neural network does not correspond precisely to the
definition of convolution as used in other fields such as engineering or pure mathematics.
We will describe several variants on the convolution function that are widely used in
practice for neural networks. We will also show how convolution may be applied to many
kinds of data, with different numbers of dimensions. We then discuss means of making
convolution more efficient. We conclude with comments about the role convolutional
networks have played in the history of deep learning.
11.1 The convolution operation
In its most general form, convolution is an operation on two functions of a real-valued
argument. To motivate the definition of convolution, let’s start with examples of two
functions we might use.
Suppose we are tracking the location of a spaceship with a laser sensor. Our laser
sensor provides a single output x(t), the position spaceship at time t. Both x and t are
real-valued, i.e., we can get a different reading from the laser sensor at any instant in
time.
Now suppose that our laser sensor is somewhat noisy. To obtain a less noisy estimate
169
of the spaceship’s position, we would like to average together several measurements.
Of course, more recent measurements are more relevant, so we will want this to be a
weighted average that gives more weight to recent measurements. We can do this with
a weighting function w(a), where a is the age of a measurement. If we apply such a
weighted average operation at every moment, we obtain a new function s providing a
smoothed estimate of the position of the spaceship:
s(t) =

āˆž
āˆ’āˆž
x(a)w(t āˆ’ a)da
This operation is called convolution. The convolution operation is typically denoted
with an asterisk:
s(t) = (x āˆ— w)(t)
In our example, w needs to be a valid probability density function, or the output
is not a weighted average. Also, w needs to be 0 for all negative arguments, or it will
look into the future, which is presumably beyond our capabilities. These limitations are
particular to our example though. In general, convolution is defined for any functions
for which the above integral is defined, and may be used for other purposes besides
taking weighted averages.
In convolutional network terminology, the first argument (in this example, the func-
tion x) to the convolution is often referred to as the input and the second argument (in
this example, the function w) as the kernel. The output is sometimes referred to as the
feature map.
In our example, the idea of a laser sensor that can provide measurements at every
instant in time is not realistic. Usually, when we work with data on a computer, time
will be discretized, and our sensor will provide data at regular intervals. In our example,
it might be more realistic to assume that our laser provides one measurement once per
second. t can then take on only integer values. If we now assume that x and w are
defined only on integer t, we can define the discrete convolution:
s[t] = (x āˆ— w)(t) =
āˆž

a=āˆ’āˆž
x[a]w[t āˆ’ a]
In machine learning applications, the input is usually an array of data and the
kernel is usually an array of learn-able parameters. Because each element of the input
and kernel must be explicitly stored separately, we usually assume that these functions
are zero everywhere but the finite set of points for which we store the values. This means
that in practice we can implement the infinite summation as a summation over a finite
number of array elements.
Finally, we often use convolutions over more than one axis at a time. For example,
if we use a two-dimensional image I as our input, we probably also want to use a
two-dimensional kernel K:
s[i, j] = (I āˆ—K)[i, j] =

m

n
I[m, n]K[i āˆ’ m, j āˆ’ n]
170
Note that convolution is commutative, meaning we can equivalently write:
s[i, j] = (I āˆ—K)[i, j] =

m

n
I[i āˆ’m, j āˆ’n]K[m, n]
Usually the latter view is more straightforward to implement in a machine learning
library, because there is less variation in the range of valid values of m and n.
Discrete convolution can be viewed as multiplication by a matrix. However, the
matrix has several entries constrained to be equal to other entries. For example, for
univariate discrete convolution, each row of the matrix is constrained to be equal to
the row above shifted by one element. This is known as a Toeplitz matrix. In two
dimensions, a doubly block circulant matrix corresponds to convolution. In addition
to these constraints that several elements be equal to each other, convolution usually
corresponds to a very sparse matrix (a matrix whose entries are mostly equal to zero).
This is because the kernel is usually much smaller than the input image. Viewing
convolution as matrix multiplication usually does not help to implement convolution
operations, but it is useful for understanding and designing neural networks. Any neural
network algorithm that works with matrix multiplication and does not depend on specific
properties of the matrix structure should work with convolution, without requiring any
further changes to the neural network. Typical convolutional neural networks do make
use of further specializations in order to deal with large inputs efficiently, but these are
not strictly necessary from a theoretical perspective.
11.2 Motivation
Convolution leverages three important ideas that can help improve a machine learning
system: sparse interactions, parameter sharing, and equivariant representations. More-
over, convolution provides a means for working with inputs of variable size. We now
describe each of these ideas in turn.
Traditional neural network layers use a matrix multiplication to describe the inter-
action between each input unit and each output unit. This means every output unit
interacts with every input unit. Convolutional networks, however, typically have sparse
interactions (also referred to as sparse connectivity or sparse weights). This is accom-
plished by making the kernel smaller than the input. For example, when processing an
image, the input image might have thousands or millions of pixels, but we can detect
small, meaningful features such as edges with kernels that occupy only tens or hundreds
of pixels. This means that we need to store fewer parameters, which both reduces the
memory requirements of the model and improves its statistical efficiency. It also means
that computing the output requires fewer operations. These improvements in efficiency
are usually quite large. If there are m inputs and n outputs, then matrix multiplication
requires m Ɨn parameters and the algorithms used in practice have O(m Ɨ n) runtime
(per example). If we limit the number of connections each output may have to k, then
the sparsely connected approach requires only k Ɨn parameters and O(k Ɨn) runtime.
For many practical applications, it is possible to obtain good performance on the ma-
171
x
1
x
2
x
3
s
2
s
1
s
3
x
4
s
4
x
5
s
5
x
1
x
2
x
3
s
2
s
1
s
3
x
4
s
4
x
5
s
5
Figure 11.1: Sparse connectivity, viewed from below: We highlight one input unit, x
3
,
and also highlight the output units in s that are affected by this unit. (Left) When s is
formed by convolution with a kernel of width 3, only three outputs are affected by x
3
.
(Right) When s is formed by matrix multiplication, connectivity is no longer sparse, so
all of the outputs are affected by x
3
.
chine learning task while keeping k several orders of magnitude smaller than m. For
graphical demonstrations of sparse connectivity, see Fig. 11.1 and Fig. 11.2.
Parameter sharing refers to using the same parameter for more than one function
in a model. In a traditional neural net, each element of the weight matrix is used
exactly once when computing the output of a layer. It is multiplied by one element of
the input, and then never revisited. As a synonym for parameter sharing, one can say
that a network has tied weights, because the value of the weight applied to one input
is tied to the value of a weight applied elsewhere. In a convolutional neural net, each
member of the kernel is used at every position of the input (except perhaps some of
the boundary pixels, depending on the design decisions regarding the boundary). The
parameter sharing used by the convolution operation means that rather than learning
a separate set of parameters for every location, we learn only one set. This does not
affect the runtime of forward propagation–it is still O(k Ɨn)–but it does further reduce
the storage requirements of the model to k parameters. Recall that k is usual several
orders of magnitude less than m. Since m and n are usually roughly the same size, k
is practically insignificant compared to m Ɨ n. Convolution is thus dramatically more
efficient than dense matrix multiplication in terms of the memory requirements and
statistical efficiency. For a graphical depiction of how parameter sharing works, see
Fig. 11.3.
As an example of both of these first two principles in action, Fig. 11.4 shows how
sparse connectivity and parameter sharing can dramatically improve the efficiency of a
linear function for detecting edges in an image.
In the case of convolution, the particular form of parameter sharing causes the layer
to have a property called equivariance to translation. To say a function is equivariant
means that if the input changes, the output changes in the same way. Specifically,
a function f(x) is equivariant to a function g if f(g(x)) = g(f(x)). In the case of
172
x
1
x
2
x
3
s
2
s
1
s
3
x
4
s
4
x
5
s
5
x
1
x
2
x
3
s
2
s
1
s
3
x
4
s
4
x
5
s
5
Figure 11.2: Sparse connectivity, viewed from above: We highlight one output unit, s
3
,
and also highlight the input units in x that affect this unit. These units are known
as the receptive field of s
3
. (Left) When s is formed by convolution with a kernel of
width 3, only three inputs affect s
3
. (Right) When s is formed by matrix multiplication,
connectivity is no longer sparse, so all of the inputs affect s
3
.
x
1
x
2
x
3
s
2
s
1
s
3
x
4
s
4
x
5
s
5
x
1
x
2
x
3
s
2
s
1
s
3
x
4
s
4
x
5
s
5
Figure 11.3: Parameter sharing: We highlight the connections that use a particular
parameter in two different models. (Left) We highlight uses of the central element
of a 3-element kernel in a convolutional model. Due to parameter sharing, this single
parameter is used at all input locations. (Right) We highlight the use of the central
element of the weight matrix in a fully connected model. This model has no parameter
sharing so the parameter is used only once.
173
convolution, if we let g be any function that translate the input, i.e., shifts it, then
the convolution function is equivariant to g. For example, define g(x) such that for
all i, g(x)[i] = x[i āˆ’ 1]. This shifts every element of x one unit to the right. If we
apply this transformation to x, then apply convolution, the result will be the same as
if we applied convolution to x, then applied the transformation to the output. When
processing time series data, this means that convolution produces a sort of timeline that
shows when different features appear in the input. If we move an event later in time
in the input, the exact same representation of it will appear in the output, just later in
time. Similarly with images, convolution creates a 2-D map of where certain features
appear in the input. If we move the object in the input, its representation will move the
same amount in the output. This is useful for when we know that same local function
is useful everywhere in the input. For example, when processing images, it is useful to
detect edges in the first layer of a convolutional network, and an edge looks the same
regardless of where it appears in the image. This property is not always useful. For
example, if we want to recognize a face, some portion of the network needs to vary with
spatial location, because the top of a face does not look the same as the bottom of a
face–the part of the network processing the top of the face needs to look for eyebrows,
while the part of the network processing the bottom of the face needs to look for a chin.
Note that convolution is not equivariant to some other transformations, such as
changes in the scale or rotation of an image. Other mechanisms are necessary for
handling these kinds of transformations.
We can think of the use of convolution as introducing an infinitely strong prior
probability distribution over the parameters of a layer. This prior says that the function
the layer should learn contains only local interactions and is equivariant to translation.
This view of convolution as an infinitely strong prior makes it clear that the efficiency
improvements of convolution come with a caveat: convolution is only applicable when
the assumptions made by this prior are close to true. The use of convolution constrains
the class of functions that the layer can represent. If the function that a layer needs to
learn is indeed a local, translation invariant function, then the layer will be dramatically
more efficient if it uses convolution rather than matrix multiplication. If the necessary
function does not have these properties, then using a convolutional layer will cause the
model to have high training error.
Finally, some kinds of data cannot be processed by neural networks defined by matrix
multiplication with a fixed-shape matrix. Convolution enables processing of some of
these kinds of data. We discuss this further in section 11.5.
11.3 Pooling
A typical layer of a convolutional network consists of three stages (see Fig. 11.5). In
the first stage, the layer performs several convolutions in parallel to produce a set of
presynaptic activations. In the second stage, each presynaptic activation is run through a
nonlinear activation function, such as the rectified linear activation function. This stage
is sometimes called the detector stage. In the third stage, we use a pooling function to
174
Figure 11.4: Efficiency of edge detection. The image on the right was formed by taking
each pixel in the original image and subtracting the value of its neighboring pixel on
the left. This shows the strength of all of the vertically oriented edges in the input
image, which can be a useful operation for object detection. Both images are 280 pixels
tall. The input image is 320 pixels wide while the output image is 319 pixels wide.
This transformation can be described by a convolution kernel containing 2 elements,
and requires 319 Ɨ 280 Ɨ 3 = 267, 960 floating point operations (two multiplications
and one addition per output pixel) to compute. To describe the same transformation
with a matrix multiplication would take 320 Ɨ280 Ɨ319 Ɨ280, or over 8 billion, entries
in the matrix, making convolution 4 billion times more efficient for representing this
transformation. The straightforward matrix multiplication algorithm performs over 16
billion floating point operations, making convolution roughly 60,000 times more efficient
computationally. Of course, most of the entries of the matrix would be zero. If we stored
only the nonzero entries of the matrix, then both matrix multiplication and convolution
would require the same number of floating point operations to compute. The matrix
would still need to contain 2 Ɨ319 Ɨ280 = 178, 640 entries. Convolution is an extremely
efficient way of describing transformations that apply the same linear transformation of
a small, local region across the entire input.
175
Convolutional Layer
Input to layer
Convolution stage:
Affine transform )
Detector stage:
Nonlinearity
e.g., rectified linear
Pooling stage
Next layer
Input to layers
Convolution layer:
Affine transform )
Detector layer:
Nonlinearity
e.g., rectified linear
Pooling layer
Next layer
Complex layer terminology Simple layer terminology
Figure 11.5: The components of a typical convolutional neural network layer. There
are two commonly used sets of terminology for describing these layers. Left) In this
terminology, the convolutional net is viewed as a small number of relatively complex
layers, with each layer having many ā€œstages.ā€ In this terminology, there is a one-to-one
mapping between kernel tensors and network layers. In this book we generally use this
terminology. Right) In this terminology, the convolutional net is viewed as a larger
number of simple layers; every step of processing is regarded as a layer in its own right.
This means that not every ā€œlayerā€ has parameters.
176
0.1 1. 0.2
1.1. 1.
0.1
0.2
... ...
... ...
0.3 0.1 1.
1.0.3 1.
0.2
1.
... ...
... ...
DETECTOR STAGE
POOLING STAGE POOLING STAGE
DETECTOR STAGE
Figure 11.6: Max pooling introduces invariance. Left: A view of the middle of the output
of a convolutional layer. The bottom row shows outputs of the nonlinearity. The top
row shows the outputs of max pooling, with a stride of 1 between pooling regions and a
pooling region width of 3. Right: A view of the same network, after the input has been
shifted to the right by 1 pixel. Every value in the bottom row has changed, but only
half of the values in the top row have changed, because the max pooling units are only
sensitive to the maximum value in the neighborhood, not its exact location.
modify the output of the layer further.
A pooling function replaces the output of the net at a certain location with a sum-
mary statistic of the nearby outputs. For example, the max pooling operation reports
the maximum output within a rectangular neighborhood. Other popular pooling func-
tions include the average of a rectangular neighborhood, the L2 norm of a rectangular
neighborhood, or a weighted average based on the distance from the central pixel.
In all cases, pooling helps to make the representation become invariant to small
translations of the input. This means that if we translate the input by a small amount,
the values of most of the pooled outputs do not change. See Fig. 11.6 for an example of
how this works. Invariance to local translation can be a very useful property
if we care more about whether some feature is present than exactly where
it is. For example, when determining whether an image contains a face, we need not
know the location of the eyes with pixel-perfect accuracy, we just need to know that
there is an eye on the left side of the face and an eye on the right side of the face. In
other contexts, it is more important to preserve the location of a feature. For example,
if we want to find a corner defined by two edges meeting at a specific orientation, we
need to preserve the location of the edges well enough to test whether they meet.
The use of pooling can be viewed as adding an infinitely strong prior that the function
the layer learns must be invariant to small translations. When this assumption is correct,
it can greatly improve the statistical efficiency of the network.
Pooling over spatial regions produces invariance to translation, but if we pool over
the outputs of separately parameterized convolutions, the features can learn which trans-
formations to become invariant to (see Fig. 11.7).
Because pooling summarizes the responses over a whole neighborhood, it is possible
to use fewer pooling units than detector units, by reporting summary statistics for
pooling regions spaced k pixels apart rather than 1 pixel apart. See Fig. 11.8 for an
example. This improves the computational efficiency of the network because the next
177
Figure 11.7: Example of learned invariances: If each of these filters drive units that
appear in the same max-pooling region, then the pooling unit will detect ā€œ5ā€s in any
rotation. By learning to have each filter be a different rotation of the ā€œ5ā€ template, this
pooling unit has learned to be invariant to rotation. This is in contrast to translation
invariance, which is usually achieved by hard-coding the net to pool over shifted versions
of a single learned filter.
0.1 1. 0.2
1. 0.2
0.1
0.1
0.0 0.1
Figure 11.8: Pooling with downsampling. Here we use max-pooling with a pool width
of 3 and a stride between pools of 2. This reduces the representation size by a factor of
2, which reduces the computational and statistical burden on the next layer. Note that
the final pool has a smaller size, but must be included if we do not want to ignore some
of the detector units.
layer has roughly k times fewer inputs to process. When the number of parameters
in the next layer is a function of its input size (such as when the next layer is fully
connected and based on matrix multiplication) this reduction in the input size can also
result in improved statistical efficiency and reduced memory requirements for storing
the parameters.
For many tasks, pooling is essential for handling inputs of varying size. For exam-
ple, if we want to classify images of variable size, the input to the classification layer
must have a fixed size. This is usually accomplished by varying the size of and offset
between pooling regions so that the classification layer always receives the same number
of summary statistics regardless of the input size. For example, the final pooling layer
of the network may be defined to output four sets of summary statistics, one for each
quadrant of an image, regardless of the image size.
178
11.4 Variants of the basic convolution function
When discussing convolution in the context of neural networks, we usually do not refer
exactly to the standard discrete convolution operation as it is usually understood in the
mathematical literature. The functions used in practice differ slightly. Here we describe
these differences in detail, and highlight some useful properties of the functions used in
neural networks.
First, when we refer to convolution in the context of neural networks, we usually
actually mean an operation that consists of many applications of convolution in parallel.
This is because convolution with a single kernel can only extract one kind of feature,
albeit at many spatial locations. Usually we want each layer of our network to extract
many kinds of features, at many locations.
Additionally, the input is usually not just a grid of real values. Rather, it is a
grid of vector-valued observations. For example, a color image has a red, green, and
blue intensity at each pixel. In a multilayer convolutional network, the input to the
second layer is the output of the first layer, which usually has the output of many
different convolutions at each position. When working with images, we usually think of
the input and output of the convolution as being 3-D arrays, with one index into the
different channels and two indices into the spatial coordinates of each channel. (Software
implementations usually work in batch mode, so they will actually use 4-D arrays, with
the fourth axis indexing different examples in the batch)
Note that because convolutional networks usually use multi-channel convolution,
the linear operations they are based on are not guaranteed to be commutative, even
if kernel-flipping is used. These multi-channel operations are only commutative if each
operation has the same number of output channels as input channels.
Convolution is traditionally defined to index the kernel in the opposite order as
the input. This definition makes convolution commutative, which can be convenient
for writing proofs. In the context of neural networks, the commutative property of
convolution is not especially helpful, so many implementations of convolution index
both the kernel and the input in the same order.
Assume we have a 4-D kernel array K with element k
i,j,k,l
giving the connection
strength between a unit in channel i of the output and a unit in channel j of the input,
with an offset of k rows and l columns between the output unit and the input unit.
Assume our input consists of observed data V with element v
i,j,k
giving the value of the
input unit within channel i at row j and column k. Assume our output consists of Z
with the same format as V. If Z is produced by convolving K across V without flipping
K, then
z
i,j,k
=

l,m,n
v
l,j+m,k+n
k
i,l,m,n
where the summation over l, m, and n is over all values for which the array indexing
operations inside the summation is valid.
We may also want to skip over some positions of the kernel in order to reduce the
computational cost (at the expense of not extracting our features as finely). If we want
179
to sample only every s pixels in each direction, then we can defined a downsampled
convolution function c such that:
z
i,j,k
= c(K, V , s)
i,j,k
=

l,m,n
[v
l,jƗs+m,kƗs+n
k
i,l,m,n
] . (11.1)
We refer to s as the stride of this downsampled convolution. It is also possible to
define a separate stride for each direction of motion.
One essential feature of any convolutional network implementation is the ability to
implicitly zero-pad the input V in order to make it wider. Without this feature, the
width of the representation shrinks by the kernel width - 1 at each layer. Zero padding
the input allows us to control the kernel width and the size of the output independently.
Without zero padding, we are forced to choose between shrinking the spatial extent of
the network rapidly and using small kernels–both scenarios that significantly limit the
expressive power of the network. See Fig. 11.9 for an example.
Three special cases of the zero-padding setting are worth mentioning. One is the
extreme case in which no zero-padding is used whatsoever, and the convolution kernel
is only allowed to visit positions where the entire kernel is contained entirely within
the image. In MATLAB terminology, this is called valid convolution. In this case, all
pixels in the output are a function of the same number of pixels in the input, so the
behavior of an output pixel is somewhat more regular. However, the size of the output
shrinks at each layer. If the input image is of size m Ɨ m and the kernel is of size
k Ɨ k, the output will be of size m āˆ’ k + 1 Ɨ m āˆ’ k + 1. The rate of this shrinkage
can be dramatic if the kernels used are large. Since the shrinkage is greater than 0,
it limits the number of convolutional layers that can be included in the network. As
layers are added, the spatial dimension of the network will eventually drop to 1 Ɨ1, at
which point additional layers cannot meaningfully be considered convolutional. Another
special case of the zero-padding setting is when just enough zero-padding is added to
keep the size of the output equal to the size of the input. MATLAB calls this same
convolution. In this case, the network can contain as many convolutional layers as the
available hardware can support, since the operation of convolution does not modify the
architectural possibilities available to the next layer. However, the input pixels near the
border influence fewer output pixels than the input pixels near the center. This can
make the border pixels somewhat underrepresented in the model. This motivates the
other extreme case, which MATLAB refers to as full convolution, in which enough zeroes
are added for every pixel to be visited k times in each direction, resulting in an output
image of size m + k āˆ’1 Ɨm + k āˆ’1. In this case, the output pixels near the border are a
function of fewer pixels than the output pixels near the center. This can make it difficult
to learn a single kernel that performs well at all positions in the convolutional feature
map. Usually the optimal amount of zero padding (in terms of test set classification
accuracy) lies somewhere between ā€œvalidā€ and ā€œsameā€ convolution.
In some cases, we do not actually want to use convolution, but rather locally con-
nected layers. In this case, the adjacency matrix in the graph of our MLP is the same,
but every connection has its own weight, specified by a 6-D array W: TODO– DWF
wants better explanation of the 6 dimensions
180
... ...
...
...
...
...
...
...
...
Figure 11.9: The effect of zero padding on network size: Consider a convolutional net-
work with a kernel of width six at every layer. In this example, do not use any pooling,
so only the convolution operation itself shrinks the network size. Top) In this convolu-
tional network, we do not use any implicit zero padding. This causes the representation
to shrink by five pixels at each layer. Starting from an input of sixteen pixels, we are
only able to have three convolutional layers, and the last layer does not ever move the
kernel, so arguably only two of the layers are truly convolutional. The rate of shrink-
ing can be mitigated by using smaller kernels, but smaller kernels are less expressive
and some shrinking is inevitable in this kind of architecture. Bottom) By adding five
implicit zeroes to each layer, we prevent the representation from shrinking with depth.
This allows us to make an arbitrarily deep convolutional network.
181
z
i,j,k
=

l,m,n
[v
l,j+m,k+n
w
i,j,k,l,m,n
] .
Locally connected layers are useful when we know that each feature should be a
function of a small part of space, but there is no reason to think that the same feature
should occur across all of space. For example, if we want to tell if an image is a picture
of a face, we only need to look for the mouth in the bottom half of the image.
It can also be useful to make versions of convolution or locally connected layers in
which the connectivity is further restricted, for example to constraint that each output
channel i be a function of only a subset of the input channels l.
Tiled convolution offers a compromise between a convolutional layer and a locally
connected layer. We make k be a 6-D tensor, where two of the dimensions correspond
to different locations in the output map. Rather than having a separate index for each
location in the output map, output locations cycle through a set of t different choices
of kernel stack in each direction. If t is equal to the output width, this is the same as a
locally connected layer.
z
i,j,k
=

l,m,n
v
l,j+m,k+n
k
i,l,m,n,j%t,k%t
Note that it is straightforward to generalize this equation to use a different tiling
range for each dimension.
Both locally connected layers and tiled convolutional layers have an interesting inter-
action with max-pooling: the detector units of these layers are driven by different filters.
If these filters learn to detect different transformed versions of the same underlying fea-
tures, then the max-pooled units become invariant to the learned transformation (see
Fig. 11.7). Convolutional layers are hard-coded to be invariant specifically to translation.
Other operations besides convolution are usually necessary to implement a convolu-
tional network. To perform learning, one must be able to compute the gradient with
respect to the kernel, given the gradient with respect to the outputs. In some simple
cases, this operation can be performed using the convolution operation, but many cases
of interest, including the case of stride greater than 1, do not have this property.
Convolution is a linear operation and can thus be described as a matrix multiplication
(if we first reshape the input array into a flat vector). The matrix involved is a function
of the convolution kernel. The matrix is sparse and each element of the kernel is copied
to several elements of the matrix. It is not usually practical to implement convolution
in this way, but it can be conceptually useful to think of it in this way.
Multiplication by the transpose of the matrix defined by convolution is also a useful
operation. This is the operation needed to backpropagate error derivatives through a
convolutional layer, so it is needed to train convolutional networks that have more than
one hidden layer. This same operation is also needed to compute the reconstruction in
a convolutional autoencoder (or to perform the analogous role in a convolutional RBM,
sparse coding model, etc.). Like the kernel gradient operation, this input gradient
operation can be implemented using a convolution in some cases, but in the general
182
case requires a third operation to be implemented. Care must be taken to coordinate
this transpose operation with the forward propagation. The size of the output that the
transpose operation should return depends on the zero padding policy and stride of the
forward propagation operation, as well as the size of the forward propagation’s output
map. In some cases, multiple sizes of input to forward propagation can result in the
same size of output map, so the transpose operation must be explicitly told what the
size of the original input was.
It turns out that these three operations–convolution, backprop from output to weights,
and backprop from output to inputs–are sufficient to compute all of the gradients needed
to train any depth of feedforward convolutional network, as well as to train convolu-
tional networks with reconstruction functions based on the transpose of convolution.
See (Goodfellow, 2010) for a full derivation of the equations in the fully general multi-
dimensional, multi-example case. To give a sense of how these equations work, we
present the two dimensional, single example version here.
Suppose we want to train a convolutional network that incorporates strided con-
volution of kernel stack K applied to multi-channel image V with stride s is defined
by c(K, V , s) as in equation 11.1. Suppose we want to minimize some loss function
J(V , K). During forward propagation, we will need to use c itself to output Z, which
is then propagated through the rest of the network and used to compute J. . During
backpropagation, we will receive an array G such that G
i,j,k
=
āˆ‚
āˆ‚z
i,j,k
J(V , K).
To train the network, we need to compute the derivatives with respect to the weights
in the kernel. To do so, we can use a function
g(G, V , s)
i,j,k,l
=
āˆ‚
āˆ‚k
i,j,k,l
J(V , K) =

m,n
g
i,m,n
v
j,mƗs+k,nƗs+l
.
If this layer is not the bottom layer of the network, we’ll need to compute the gradient
with respect to V in order to backpropagate the error farther down. To do so, we can
use a function
h(K, G, s)
i,j,k
=
āˆ‚
āˆ‚v
i,j,k
J(V , K) =

l,m|sƗl+m=j

n,p|sƗn+p=k

q
k
q,i,m,p
g
i,l,n
.
We could also use h to define the reconstruction of a convolutional autoencoder, or
the probability distribution over visible given hidden units in a convolutional RBM or
sparse coding model. Suppose we have hidden units H in the same format as Z and we
define a reconstruction
R = h(K, H, s).
In order to train the autoencoder, we will receive the gradient with respect to R as
an array E. To train the decoder, we need to obtain the gradient with respect to K.
This is given by g(H, E, s). To train the encoder, we need to obtain the gradient with
respect to H. This is given by c(K, E, s). It is also possible to differentiate through g
using c and h, but these operations are not needed for the backpropagation algorithm
on any standard network architectures.
183
Generally, we do not use only a linear operation in order to transform from the
inputs to the outputs in a convolutional layer. We generally also add some bias term to
each output before applying the nonlinearity. This raises the question of how to share
parameters among the biases. For locally connected layers it is natural to give each
unit its own bias, and for tiled convolution, it is natural to share the biases with the
same tiling pattern as the kernels. For convolutional layers, it is typical to have one
bias per channel of the output and share it across all locations within each convolution
map. However, if the input is of known, fixed size, it is also possible to learn a separate
bias at each location of the output map. Separating the biases may slightly reduce the
statistical efficiency of the model, but also allows the model to correct for differences
in the image statistics at different locations. For example, when using implicit zero
padding, detector units at the edge of the image receive less total input and may need
larger biases.
11.5 Data types
The data used with a convolutional network usually consists of several channels, each
channel being the observation of a different quantity at some point in space or time.
See Table 11.1 for examples of data types with different dimensionalities and number of
channels.
So far we have discussed only the case where every example in the train and test
data has the same spatial dimensions. One advantage to convolutional networks is that
they can also process inputs with varying spatial extents. These kinds of input simply
cannot be represented by traditional, matrix multiplication-based neural networks. This
provides a compelling reason to use convolutional networks even when computational
cost and overfitting are not significant issues.
For example, consider a collection of images, where each image has a different width
and height. It is unclear how to apply matrix multiplication. Convolution is straight-
forward to apply; the kernel is simply applied a different number of times depending on
the size of the input, and the output of the convolution operation scales accordingly.
Sometimes the output of the network is allowed to have variable size as well as the
input, for example if we want to assign a class label to each pixel of the input. In this
case, no further design work is necessary. In other cases, the network must produce
some fixed-size output, for example if we want to assign a single class label to the entire
image. In this case we must make some additional design steps, like inserting a pooling
layer whose pooling regions scale in size proportional to the size of the input, in order
to maintain a fixed number of pooled outputs.
Note that the use of convolution for processing variable sized inputs only makes sense
for inputs that have variable size because they contain varying amounts of observation
of the same kind of thing–different lengths of recordings over time, different widths of
observations over space, etc. Convolution does not make sense if the input has variable
size because it can optionally include different kinds of observations. For example,
if we are processing college applications, and our features consist of both grades and
184
Single channel Multi-channel
1-D Audio waveform: The axis we con-
volve over corresponds to time. We
discretize time and measure the
amplitude of the waveform once per
time step.
Skeleton animation data: Ani-
mations of 3-D computer-rendered
characters are generated by alter-
ing the pose of a ā€œskeletonā€ over
time. At each point in time, the
pose of the character is described
by a specification of the angles of
each of the joints in the character’s
skeleton. Each channel in the data
we feed to the convolutional model
represents the angle about one axis
of one joint.
2-D Audio data that has been prepro-
cessed with a Fourier transform:
We can transform the audio wave-
form into a 2D array with differ-
ent rows corresponding to different
frequencies and different columns
corresponding to different points in
time. Using convolution in the
time makes the model equivariant
to shifts in time. Using convolution
across the frequency axis makes the
model equivariant to frequency, so
that the same melody played in a
different octave produces the same
representation but at a different
height in the network’s output.
Color image data: One channel
contains the red pixels, one the
green pixels, and one the blue pix-
els. The convolution kernel moves
over both the horizontal and ver-
tical axes of the image, conferring
translation equivariance in both di-
rections.
3-D Volumetric data: A common source
of this kind of data is medical imag-
ing technology, such as CT scans.
Color video data: One axis corre-
sponds to time, one to the height
of the video frame, and one to the
width of the video frame.
Table 11.1: Examples of different formats of data that can be used with convolutional
networks.
185
standardized test scores, but not every applicant took the standardized test, then it
does not make sense to convolve the same weights over both the features corresponding
to the grades and the features corresponding to the test scores.
11.6 Efficient convolution algorithms
Modern convolutional network applications often involve networks containing more than
one million units. Powerful implementations exploiting parallel computation resources
are essential. TODO: add reference to efficient implementations section of applications
chapter. However, it is also possible to obtain an algorithmic speedup in many cases.
Convolution is equivalent to converting both the input and the kernel to the fre-
quency domain using a Fourier transform, performing point-wise multiplication of the
two signals, and converting back to the time domain using an inverse Fourier transform.
For some problem sizes, this can be faster than the naive implementation of discrete
convolution.
When a d-dimensional kernel can be expressed as the outer product of d vectors, one
vector per dimension, the kernel is called separable. When the kernel is separable, naive
convolution is inefficient. It is equivalent to compose d one-dimensional convolutions
with each of these vectors. The composed approach is significantly faster than perform-
ing one k-dimensional convolution with their outer product. The kernel also takes fewer
parameters to represent as vectors. If the kernel is w elements wide in each dimension,
then naive multidimensional convolution requires O(w
d
) runtime and parameter storage
space, while separable convolution requires O(w Ɨ d) runtime and parameter storage
space. Of course, not every convolution can be represented in this way.
Devising faster ways of performing convolution or approximate convolution without
harming the accuracy of the model is an active area of research. Even techniques that
improve the efficiency of only forward propagation are useful because in the commercial
setting, it is typical to devote many more resources to deployment of a network than to
its training.
11.7 Deep learning history
TODO: deep learning history, conv nets were first really successful deep net
TODO: figure out where in the book to put: TODOsectionPooling and Linear-by-
Part Non-Linearities: Drednets and Maxout
186