Chapter 9

Convolutional Networks

Convolutional networks (also known as convolutional neural networks or CNNs) are a

specialized kind of neural network for processing data that has a known, grid-like topol-

ogy. Examples include time-series data, which can be thought of as a 1D grid taking

samples at regular time intervals, and image data, which can be thought of as a 2D grid of

pixels. Convolutional networks have been tremendously successful in practical applica-

tions. See chapter 13.1.2 for details. The name “convolutional neural network” indicates

that the network employs a mathematical operation called convolution. Convolution is

a specialized kind of linear operation. Convolutional networks are simply neural

networks that use convolution in place of general matrix multiplication.

In this chapter, we will ﬁrst describe what convolution is. Next, we will explain

the motivation behind using convolution in a neural network. We will then describe an

operation called pooling, which almost all convolutional networks employ. Usually, the

operation used in a convolutional neural network does not correspond precisely to the

deﬁnition of convolution as used in other ﬁelds such as engineering or pure mathematics.

We will describe several variants on the convolution function that are widely used in

practice for neural networks. We will also show how convolution may be applied to many

kinds of data, with diﬀerent numbers of dimensions. We then discuss means of making

convolution more eﬃcient. We conclude with comments about the role convolutional

networks have played in the history of deep learning.

9.1 The Convolution Operation

In its most general form, convolution is an operation on two functions of a real-valued

argument. To motivate the deﬁnition of convolution, let’s start with examples of two

functions we might use.

Suppose we are tracking the location of a spaceship with a laser sensor. Our laser

sensor provides a single output x(t), the position spaceship at time t. Both x and t are

real-valued, i.e., we can get a diﬀerent reading from the laser sensor at any instant in

time.

Now suppose that our laser sensor is somewhat noisy. To obtain a less noisy estimate

173

of the spaceship’s position, we would like to average together several measurements.

Of course, more recent measurements are more relevant, so we will want this to be a

weighted average that gives more weight to recent measurements. We can do this with

a weighting function w(a), where a is the age of a measurement. If we apply such a

weighted average operation at every moment, we obtain a new function s providing a

smoothed estimate of the position of the spaceship:

s(t) =

x(a)w(t − a)da

This operation is called convolution. The convolution operation is typically denoted

with an asterisk:

s(t) = (x ∗ w)(t)

In our example, w needs to be a valid probability density function, or the output

is not a weighted average. Also, w needs to be 0 for all negative arguments, or it will

look into the future, which is presumably beyond our capabilities. These limitations are

particular to our example though. In general, convolution is deﬁned for any functions

for which the above integral is deﬁned, and may be used for other purposes besides

taking weighted averages.

In convolutional network terminology, the ﬁrst argument (in this example, the func-

tion x) to the convolution is often referred to as the input and the second argument (in

this example, the function w) as the kernel. The output is sometimes referred to as the

feature map.

In our example, the idea of a laser sensor that can provide measurements at every

instant in time is not realistic. Usually, when we work with data on a computer, time

will be discretized, and our sensor will provide data at regular intervals. In our example,

it might be more realistic to assume that our laser provides one measurement once per

second. t can then take on only integer values. If we now assume that x and w are

deﬁned only on integer t, we can deﬁne the discrete convolution:

s[t] = (x ∗ w)(t) =

∞

a=−∞

x[a]w[t − a]

In machine learning applications, the input is usually an array of data and the

kernel is usually an array of learn-able parameters. Because each element of the input

and kernel must be explicitly stored separately, we usually assume that these functions

are zero everywhere but the ﬁnite set of points for which we store the values. This means

that in practice we can implement the inﬁnite summation as a summation over a ﬁnite

number of array elements.

Finally, we often use convolutions over more than one axis at a time. For example,

if we use a two-dimensional image I as our input, we probably also want to use a

two-dimensional kernel K:

s[i, j] = (I ∗K)[i, j] =

I[m, n]K[i − m, j − n]

174

Note that convolution is commutative, meaning we can equivalently write:

s[i, j] = (I ∗K)[i, j] =

I[i −m, j −n]K[m, n]

Usually the latter view is more straightforward to implement in a machine learning

library, because there is less variation in the range of valid values of m and n.

Discrete convolution can be viewed as multiplication by a matrix. However, the

matrix has several entries constrained to be equal to other entries. For example, for

univariate discrete convolution, each row of the matrix is constrained to be equal to

the row above shifted by one element. This is known as a Toeplitz matrix. In two

dimensions, a doubly block circulant matrix corresponds to convolution. In addition

to these constraints that several elements be equal to each other, convolution usually

corresponds to a very sparse matrix (a matrix whose entries are mostly equal to zero).

This is because the kernel is usually much smaller than the input image. Viewing

convolution as matrix multiplication usually does not help to implement convolution

operations, but it is useful for understanding and designing neural networks. Any neural

network algorithm that works with matrix multiplication and does not depend on speciﬁc

properties of the matrix structure should work with convolution, without requiring any

further changes to the neural network. Typical convolutional neural networks do make

use of further specializations in order to deal with large inputs eﬃciently, but these are

not strictly necessary from a theoretical perspective.

9.2 Motivation

Convolution leverages three important ideas that can help improve a machine learning

system: sparse interactions, parameter sharing, and equivariant representations. More-

over, convolution provides a means for working with inputs of variable size. We now

describe each of these ideas in turn.

Traditional neural network layers use a matrix multiplication to describe the inter-

action between each input unit and each output unit. This means every output unit

interacts with every input unit. Convolutional networks, however, typically have sparse

interactions (also referred to as sparse connectivity or sparse weights). This is accom-

plished by making the kernel smaller than the input. For example, when processing an

image, the input image might have thousands or millions of pixels, but we can detect

small, meaningful features such as edges with kernels that occupy only tens or hundreds

of pixels. This means that we need to store fewer parameters, which both reduces the

memory requirements of the model and improves its statistical eﬃciency. It also means

that computing the output requires fewer operations. These improvements in eﬃciency

are usually quite large. If there are m inputs and n outputs, then matrix multiplication

requires m ×n parameters and the algorithms used in practice have O(m × n) runtime

(per example). If we limit the number of connections each output may have to k, then

the sparsely connected approach requires only k ×n parameters and O(k ×n) runtime.

For many practical applications, it is possible to obtain good performance on the ma-

175

Figure 9.1: Sparse connectivity, viewed from below: We highlight one input unit, x

and also highlight the output units in s that are aﬀected by this unit. (Left) When s is

formed by convolution with a kernel of width 3, only three outputs are aﬀected by x

(Right) When s is formed by matrix multiplication, connectivity is no longer sparse, so

all of the outputs are aﬀected by x

chine learning task while keeping k several orders of magnitude smaller than m. For

graphical demonstrations of sparse connectivity, see Fig. 9.1 and Fig. 9.2.

Parameter sharing refers to using the same parameter for more than one function

in a model. In a traditional neural net, each element of the weight matrix is used

exactly once when computing the output of a layer. It is multiplied by one element of

the input, and then never revisited. As a synonym for parameter sharing, one can say

that a network has tied weights, because the value of the weight applied to one input

is tied to the value of a weight applied elsewhere. In a convolutional neural net, each

member of the kernel is used at every position of the input (except perhaps some of

the boundary pixels, depending on the design decisions regarding the boundary). The

parameter sharing used by the convolution operation means that rather than learning

a separate set of parameters for every location, we learn only one set. This does not

aﬀect the runtime of forward propagation–it is still O(k ×n)–but it does further reduce

the storage requirements of the model to k parameters. Recall that k is usual several

orders of magnitude less than m. Since m and n are usually roughly the same size, k

is practically insigniﬁcant compared to m × n. Convolution is thus dramatically more

eﬃcient than dense matrix multiplication in terms of the memory requirements and

statistical eﬃciency. For a graphical depiction of how parameter sharing works, see

Fig. 9.3.

As an example of both of these ﬁrst two principles in action, Fig. 9.4 shows how

sparse connectivity and parameter sharing can dramatically improve the eﬃciency of a

linear function for detecting edges in an image.

In the case of convolution, the particular form of parameter sharing causes the layer

to have a property called equivariance to translation. To say a function is equivariant

means that if the input changes, the output changes in the same way. Speciﬁcally,

a function f(x) is equivariant to a function g if f(g(x)) = g(f(x)). In the case of

176

Figure 9.2: Sparse connectivity, viewed from above: We highlight one output unit, s

and also highlight the input units in x that aﬀect this unit. These units are known

as the receptive ﬁeld of s

. (Left) When s is formed by convolution with a kernel of

width 3, only three inputs aﬀect s

. (Right) When s is formed by matrix multiplication,

connectivity is no longer sparse, so all of the inputs aﬀect s

Figure 9.3: Parameter sharing: We highlight the connections that use a particular

parameter in two diﬀerent models. (Left) We highlight uses of the central element

of a 3-element kernel in a convolutional model. Due to parameter sharing, this single

parameter is used at all input locations. (Right) We highlight the use of the central

element of the weight matrix in a fully connected model. This model has no parameter

sharing so the parameter is used only once.

177

convolution, if we let g be any function that translate the input, i.e., shifts it, then

the convolution function is equivariant to g. For example, deﬁne g(x) such that for

all i, g(x)[i] = x[i − 1]. This shifts every element of x one unit to the right. If we

apply this transformation to x, then apply convolution, the result will be the same as

if we applied convolution to x, then applied the transformation to the output. When

processing time series data, this means that convolution produces a sort of timeline that

shows when diﬀerent features appear in the input. If we move an event later in time

in the input, the exact same representation of it will appear in the output, just later in

time. Similarly with images, convolution creates a 2-D map of where certain features

appear in the input. If we move the object in the input, its representation will move the

same amount in the output. This is useful for when we know that same local function

is useful everywhere in the input. For example, when processing images, it is useful to

detect edges in the ﬁrst layer of a convolutional network, and an edge looks the same

regardless of where it appears in the image. This property is not always useful. For

example, if we want to recognize a face, some portion of the network needs to vary with

spatial location, because the top of a face does not look the same as the bottom of a

face–the part of the network processing the top of the face needs to look for eyebrows,

while the part of the network processing the bottom of the face needs to look for a chin.

Note that convolution is not equivariant to some other transformations, such as

changes in the scale or rotation of an image. Other mechanisms are necessary for

handling these kinds of transformations.

We can think of the use of convolution as introducing an inﬁnitely strong prior

probability distribution over the parameters of a layer. This prior says that the function

the layer should learn contains only local interactions and is equivariant to translation.

This view of convolution as an inﬁnitely strong prior makes it clear that the eﬃciency

improvements of convolution come with a caveat: convolution is only applicable when

the assumptions made by this prior are close to true. The use of convolution constrains

the class of functions that the layer can represent. If the function that a layer needs to

learn is indeed a local, translation invariant function, then the layer will be dramatically

more eﬃcient if it uses convolution rather than matrix multiplication. If the necessary

function does not have these properties, then using a convolutional layer will cause the

model to have high training error.

Finally, some kinds of data cannot be processed by neural networks deﬁned by matrix

multiplication with a ﬁxed-shape matrix. Convolution enables processing of some of

these kinds of data. We discuss this further in section 9.5.

9.3 Pooling

A typical layer of a convolutional network consists of three stages (see Fig. 9.5). In

the ﬁrst stage, the layer performs several convolutions in parallel to produce a set of

presynaptic activations. In the second stage, each presynaptic activation is run through a

nonlinear activation function, such as the rectiﬁed linear activation function. This stage

is sometimes called the detector stage. In the third stage, we use a pooling function to

178

Figure 9.4: Eﬃciency of edge detection. The image on the right was formed by taking

each pixel in the original image and subtracting the value of its neighboring pixel on

the left. This shows the strength of all of the vertically oriented edges in the input

image, which can be a useful operation for object detection. Both images are 280 pixels

tall. The input image is 320 pixels wide while the output image is 319 pixels wide.

This transformation can be described by a convolution kernel containing 2 elements,

and requires 319 × 280 × 3 = 267, 960 ﬂoating point operations (two multiplications

and one addition per output pixel) to compute. To describe the same transformation

with a matrix multiplication would take 320 ×280 ×319 ×280, or over 8 billion, entries

in the matrix, making convolution 4 billion times more eﬃcient for representing this

transformation. The straightforward matrix multiplication algorithm performs over 16

billion ﬂoating point operations, making convolution roughly 60,000 times more eﬃcient

computationally. Of course, most of the entries of the matrix would be zero. If we stored

only the nonzero entries of the matrix, then both matrix multiplication and convolution

would require the same number of ﬂoating point operations to compute. The matrix

would still need to contain 2 ×319 ×280 = 178, 640 entries. Convolution is an extremely

eﬃcient way of describing transformations that apply the same linear transformation of

a small, local region across the entire input.

179

Convolutional Layer

Input to layer

Convolution stage:

Afﬁne transform

Detector stage:

Nonlinearity

e.g., rectiﬁed linear

Pooling stage

Next layer

Input to layers

Convolution layer:

Afﬁne transform

Detector layer:

Nonlinearity

e.g., rectiﬁed linear

Pooling layer

Next layer

Complex layer terminology Simple layer terminology

Figure 9.5: The components of a typical convolutional neural network layer. There

are two commonly used sets of terminology for describing these layers. Left) In this

terminology, the convolutional net is viewed as a small number of relatively complex

layers, with each layer having many “stages.” In this terminology, there is a one-to-one

mapping between kernel tensors and network layers. In this book we generally use this

terminology. Right) In this terminology, the convolutional net is viewed as a larger

number of simple layers; every step of processing is regarded as a layer in its own right.

This means that not every “layer” has parameters.

180

0.1 1. 0.2

1.1. 1.

0.1

0.2

... ...

0.3 0.1 1.

1.0.3 1.

0.2

... ...

DETECTOR STAGE

POOLING STAGE POOLING STAGE

DETECTOR STAGE

Figure 9.6: Max pooling introduces invariance. Left: A view of the middle of the output

of a convolutional layer. The bottom row shows outputs of the nonlinearity. The top

row shows the outputs of max pooling, with a stride of 1 between pooling regions and a

pooling region width of 3. Right: A view of the same network, after the input has been

shifted to the right by 1 pixel. Every value in the bottom row has changed, but only

half of the values in the top row have changed, because the max pooling units are only

sensitive to the maximum value in the neighborhood, not its exact location.

modify the output of the layer further.

A pooling function replaces the output of the net at a certain location with a sum-

mary statistic of the nearby outputs. For example, the max pooling operation reports

the maximum output within a rectangular neighborhood. Other popular pooling func-

tions include the average of a rectangular neighborhood, the L2 norm of a rectangular

neighborhood, or a weighted average based on the distance from the central pixel.

In all cases, pooling helps to make the representation become invariant to small

translations of the input. This means that if we translate the input by a small amount,

the values of most of the pooled outputs do not change. See Fig. 9.6 for an example of

how this works. Invariance to local translation can be a very useful property

if we care more about whether some feature is present than exactly where

it is. For example, when determining whether an image contains a face, we need not

know the location of the eyes with pixel-perfect accuracy, we just need to know that

there is an eye on the left side of the face and an eye on the right side of the face. In

other contexts, it is more important to preserve the location of a feature. For example,

if we want to ﬁnd a corner deﬁned by two edges meeting at a speciﬁc orientation, we

need to preserve the location of the edges well enough to test whether they meet.

The use of pooling can be viewed as adding an inﬁnitely strong prior that the function

the layer learns must be invariant to small translations. When this assumption is correct,

it can greatly improve the statistical eﬃciency of the network.

Pooling over spatial regions produces invariance to translation, but if we pool over

the outputs of separately parametrized convolutions, the features can learn which trans-

formations to become invariant to (see Fig. 9.7).

Because pooling summarizes the responses over a whole neighborhood, it is possible

to use fewer pooling units than detector units, by reporting summary statistics for

pooling regions spaced k pixels apart rather than 1 pixel apart. See Fig. 9.8 for an

example. This improves the computational eﬃciency of the network because the next

layer has roughly k times fewer inputs to process. When the number of parameters

181

Figure 9.7: Example of learned invariances: If each of these ﬁlters drive units that

appear in the same max-pooling region, then the pooling unit will detect “5”s in any

rotation. By learning to have each ﬁlter be a diﬀerent rotation of the “5” template, this

pooling unit has learned to be invariant to rotation. This is in contrast to translation

invariance, which is usually achieved by hard-coding the net to pool over shifted versions

of a single learned ﬁlter.

0.1 1. 0.2

1. 0.2

0.1

0.0 0.1

Figure 9.8: Pooling with downsampling. Here we use max-pooling with a pool width of

3 and a stride between pools of 2. This reduces the representation size by a factor of

2, which reduces the computational and statistical burden on the next layer. Note that

the ﬁnal pool has a smaller size, but must be included if we do not want to ignore some

of the detector units.

in the next layer is a function of its input size (such as when the next layer is fully

connected and based on matrix multiplication) this reduction in the input size can also

result in improved statistical eﬃciency and reduced memory requirements for storing

the parameters.

For many tasks, pooling is essential for handling inputs of varying size. For exam-

ple, if we want to classify images of variable size, the input to the classiﬁcation layer

must have a ﬁxed size. This is usually accomplished by varying the size of and oﬀset

between pooling regions so that the classiﬁcation layer always receives the same number

of summary statistics regardless of the input size. For example, the ﬁnal pooling layer

of the network may be deﬁned to output four sets of summary statistics, one for each

quadrant of an image, regardless of the image size.

182

9.4 Variants of the Basic Convolution Function

When discussing convolution in the context of neural networks, we usually do not refer

exactly to the standard discrete convolution operation as it is usually understood in the

mathematical literature. The functions used in practice diﬀer slightly. Here we describe

these diﬀerences in detail, and highlight some useful properties of the functions used in

neural networks.

First, when we refer to convolution in the context of neural networks, we usually

actually mean an operation that consists of many applications of convolution in parallel.

This is because convolution with a single kernel can only extract one kind of feature,

albeit at many spatial locations. Usually we want each layer of our network to extract

many kinds of features, at many locations.

Additionally, the input is usually not just a grid of real values. Rather, it is a

grid of vector-valued observations. For example, a color image has a red, green, and

blue intensity at each pixel. In a multilayer convolutional network, the input to the

second layer is the output of the ﬁrst layer, which usually has the output of many

diﬀerent convolutions at each position. When working with images, we usually think of

the input and output of the convolution as being 3-D arrays, with one index into the

diﬀerent channels and two indices into the spatial coordinates of each channel. (Software

implementations usually work in batch mode, so they will actually use 4-D arrays, with

the fourth axis indexing diﬀerent examples in the batch)

Note that because convolutional networks usually use multi-channel convolution,

the linear operations they are based on are not guaranteed to be commutative, even

if kernel-ﬂipping is used. These multi-channel operations are only commutative if each

operation has the same number of output channels as input channels.

Convolution is traditionally deﬁned to index the kernel in the opposite order as

the input. This deﬁnition makes convolution commutative, which can be convenient

for writing proofs. In the context of neural networks, the commutative property of

convolution is not especially helpful, so many implementations of convolution index

both the kernel and the input in the same order.

Assume we have a 4-D kernel array K with element k

i,j,k,l

giving the connection

strength between a unit in channel i of the output and a unit in channel j of the input,

with an oﬀset of k rows and l columns between the output unit and the input unit.

Assume our input consists of observed data V with element v

i,j,k

giving the value of the

input unit within channel i at row j and column k. Assume our output consists of Z

with the same format as V. If Z is produced by convolving K across V without ﬂipping

K, then

i,j,k

l,m,n

l,j+m,k+n

i,l,m,n

where the summation over l, m, and n is over all values for which the array indexing

operations inside the summation is valid.

We may also want to skip over some positions of the kernel in order to reduce the

computational cost (at the expense of not extracting our features as ﬁnely). If we want

183

to sample only every s pixels in each direction, then we can deﬁned a downsampled

convolution function c such that:

i,j,k

= c(K, V , s)

i,j,k

l,m,n

l,j×s+m,k×s+n

i,l,m,n

] . (9.1)

We refer to s as the stride of this downsampled convolution. It is also possible to

deﬁne a separate stride for each direction of motion.

One essential feature of any convolutional network implementation is the ability to

implicitly zero-pad the input V in order to make it wider. Without this feature, the

width of the representation shrinks by the kernel width - 1 at each layer. Zero padding

the input allows us to control the kernel width and the size of the output independently.

Without zero padding, we are forced to choose between shrinking the spatial extent of

the network rapidly and using small kernels–both scenarios that signiﬁcantly limit the

expressive power of the network. See Fig. 9.9 for an example.

Three special cases of the zero-padding setting are worth mentioning. One is the

extreme case in which no zero-padding is used whatsoever, and the convolution kernel

is only allowed to visit positions where the entire kernel is contained entirely within

the image. In MATLAB terminology, this is called valid convolution. In this case, all

pixels in the output are a function of the same number of pixels in the input, so the

behavior of an output pixel is somewhat more regular. However, the size of the output

shrinks at each layer. If the input image is of size m × m and the kernel is of size

k × k, the output will be of size m − k + 1 × m − k + 1. The rate of this shrinkage

can be dramatic if the kernels used are large. Since the shrinkage is greater than 0,

it limits the number of convolutional layers that can be included in the network. As

layers are added, the spatial dimension of the network will eventually drop to 1 ×1, at

which point additional layers cannot meaningfully be considered convolutional. Another

special case of the zero-padding setting is when just enough zero-padding is added to

keep the size of the output equal to the size of the input. MATLAB calls this same

convolution. In this case, the network can contain as many convolutional layers as the

available hardware can support, since the operation of convolution does not modify the

architectural possibilities available to the next layer. However, the input pixels near the

border inﬂuence fewer output pixels than the input pixels near the center. This can

make the border pixels somewhat underrepresented in the model. This motivates the

other extreme case, which MATLAB refers to as full convolution, in which enough zeroes

are added for every pixel to be visited k times in each direction, resulting in an output

image of size m + k −1 ×m + k −1. In this case, the output pixels near the border are a

function of fewer pixels than the output pixels near the center. This can make it diﬃcult

to learn a single kernel that performs well at all positions in the convolutional feature

map. Usually the optimal amount of zero padding (in terms of test set classiﬁcation

accuracy) lies somewhere between “valid” and “same” convolution.

In some cases, we do not actually want to use convolution, but rather locally con-

nected layers. In this case, the adjacency matrix in the graph of our MLP is the same,

but every connection has its own weight, speciﬁed by a 6-D array W: TODO– DWF

wants better explanation of the 6 dimensions

184

... ...

...

Figure 9.9: The eﬀect of zero padding on network size: Consider a convolutional network

with a kernel of width six at every layer. In this example, do not use any pooling, so only

the convolution operation itself shrinks the network size. Top) In this convolutional

network, we do not use any implicit zero padding. This causes the representation to

shrink by ﬁve pixels at each layer. Starting from an input of sixteen pixels, we are only

able to have three convolutional layers, and the last layer does not ever move the kernel,

so arguably only two of the layers are truly convolutional. The rate of shrinking can

be mitigated by using smaller kernels, but smaller kernels are less expressive and some

shrinking is inevitable in this kind of architecture. Bottom) By adding ﬁve implicit

zeroes to each layer, we prevent the representation from shrinking with depth. This

allows us to make an arbitrarily deep convolutional network.

185

i,j,k

l,m,n

l,j+m,k+n

i,j,k,l,m,n

] .

Locally connected layers are useful when we know that each feature should be a

function of a small part of space, but there is no reason to think that the same feature

should occur across all of space. For example, if we want to tell if an image is a picture

of a face, we only need to look for the mouth in the bottom half of the image.

It can also be useful to make versions of convolution or locally connected layers in

which the connectivity is further restricted, for example to constraint that each output

channel i be a function of only a subset of the input channels l.

Tiled convolution oﬀers a compromise between a convolutional layer and a locally

connected layer. We make k be a 6-D tensor, where two of the dimensions correspond

to diﬀerent locations in the output map. Rather than having a separate index for each

location in the output map, output locations cycle through a set of t diﬀerent choices

of kernel stack in each direction. If t is equal to the output width, this is the same as a

locally connected layer.

i,j,k

l,m,n

l,j+m,k+n

i,l,m,n,j%t,k%t

Note that it is straightforward to generalize this equation to use a diﬀerent tiling

range for each dimension.

Both locally connected layers and tiled convolutional layers have an interesting inter-

action with max-pooling: the detector units of these layers are driven by diﬀerent ﬁlters.

If these ﬁlters learn to detect diﬀerent transformed versions of the same underlying fea-

tures, then the max-pooled units become invariant to the learned transformation (see

Fig. 9.7). Convolutional layers are hard-coded to be invariant speciﬁcally to translation.

Other operations besides convolution are usually necessary to implement a convolu-

tional network. To perform learning, one must be able to compute the gradient with

respect to the kernel, given the gradient with respect to the outputs. In some simple

cases, this operation can be performed using the convolution operation, but many cases

of interest, including the case of stride greater than 1, do not have this property.

Convolution is a linear operation and can thus be described as a matrix multiplication

(if we ﬁrst reshape the input array into a ﬂat vector). The matrix involved is a function

of the convolution kernel. The matrix is sparse and each element of the kernel is copied

to several elements of the matrix. It is not usually practical to implement convolution

in this way, but it can be conceptually useful to think of it in this way.

Multiplication by the transpose of the matrix deﬁned by convolution is also a useful

operation. This is the operation needed to backpropagate error derivatives through a

convolutional layer, so it is needed to train convolutional networks that have more than

one hidden layer. This same operation is also needed to compute the reconstruction in

a convolutional autoencoder (or to perform the analogous role in a convolutional RBM,

sparse coding model, etc.). Like the kernel gradient operation, this input gradient

operation can be implemented using a convolution in some cases, but in the general

186

case requires a third operation to be implemented. Care must be taken to coordinate

this transpose operation with the forward propagation. The size of the output that the

transpose operation should return depends on the zero padding policy and stride of the

forward propagation operation, as well as the size of the forward propagation’s output

map. In some cases, multiple sizes of input to forward propagation can result in the

same size of output map, so the transpose operation must be explicitly told what the

size of the original input was.

It turns out that these three operations–convolution, backprop from output to weights,

and backprop from output to inputs–are suﬃcient to compute all of the gradients needed

to train any depth of feedforward convolutional network, as well as to train convolu-

tional networks with reconstruction functions based on the transpose of convolution.

See (Goodfellow, 2010) for a full derivation of the equations in the fully general multi-

dimensional, multi-example case. To give a sense of how these equations work, we

present the two dimensional, single example version here.

Suppose we want to train a convolutional network that incorporates strided con-

volution of kernel stack K applied to multi-channel image V with stride s is deﬁned

by c(K, V , s) as in equation 9.1. Suppose we want to minimize some loss function

J(V , K). During forward propagation, we will need to use c itself to output Z, which

is then propagated through the rest of the network and used to compute J. . During

backpropagation, we will receive an array G such that G

i,j,k

∂

∂z

i,j,k

J(V , K).

To train the network, we need to compute the derivatives with respect to the weights

in the kernel. To do so, we can use a function

g(G, V , s)

i,j,k,l

∂

∂k

i,j,k,l

J(V , K) =

m,n

i,m,n

j,m×s+k,n×s+l

If this layer is not the bottom layer of the network, we’ll need to compute the gradient

with respect to V in order to backpropagate the error farther down. To do so, we can

use a function

h(K, G, s)

i,j,k

∂

∂v

i,j,k

J(V , K) =

l,m|s×l+m=j

n,p|s×n+p=k

q,i,m,p

i,l,n

We could also use h to deﬁne the reconstruction of a convolutional autoencoder, or

the probability distribution over visible given hidden units in a convolutional RBM or

sparse coding model. Suppose we have hidden units H in the same format as Z and we

deﬁne a reconstruction

R = h(K, H, s).

In order to train the autoencoder, we will receive the gradient with respect to R as

an array E. To train the decoder, we need to obtain the gradient with respect to K.

This is given by g(H, E, s). To train the encoder, we need to obtain the gradient with

respect to H. This is given by c(K, E, s). It is also possible to diﬀerentiate through g

using c and h, but these operations are not needed for the backpropagation algorithm

on any standard network architectures.

187

Generally, we do not use only a linear operation in order to transform from the

inputs to the outputs in a convolutional layer. We generally also add some bias term to

each output before applying the nonlinearity. This raises the question of how to share

parameters among the biases. For locally connected layers it is natural to give each

unit its own bias, and for tiled convolution, it is natural to share the biases with the

same tiling pattern as the kernels. For convolutional layers, it is typical to have one

bias per channel of the output and share it across all locations within each convolution

map. However, if the input is of known, ﬁxed size, it is also possible to learn a separate

bias at each location of the output map. Separating the biases may slightly reduce the

statistical eﬃciency of the model, but also allows the model to correct for diﬀerences

in the image statistics at diﬀerent locations. For example, when using implicit zero

padding, detector units at the edge of the image receive less total input and may need

larger biases.

9.5 Data Types

The data used with a convolutional network usually consists of several channels, each

channel being the observation of a diﬀerent quantity at some point in space or time.

See Table 9.1 for examples of data types with diﬀerent dimensionalities and number of

channels.

So far we have discussed only the case where every example in the train and test

data has the same spatial dimensions. One advantage to convolutional networks is that

they can also process inputs with varying spatial extents. These kinds of input simply

cannot be represented by traditional, matrix multiplication-based neural networks. This

provides a compelling reason to use convolutional networks even when computational

cost and overﬁtting are not signiﬁcant issues.

For example, consider a collection of images, where each image has a diﬀerent width

and height. It is unclear how to apply matrix multiplication. Convolution is straight-

forward to apply; the kernel is simply applied a diﬀerent number of times depending on

the size of the input, and the output of the convolution operation scales accordingly.

Sometimes the output of the network is allowed to have variable size as well as the

input, for example if we want to assign a class label to each pixel of the input. In this

case, no further design work is necessary. In other cases, the network must produce

some ﬁxed-size output, for example if we want to assign a single class label to the entire

image. In this case we must make some additional design steps, like inserting a pooling

layer whose pooling regions scale in size proportional to the size of the input, in order

to maintain a ﬁxed number of pooled outputs.

Note that the use of convolution for processing variable sized inputs only makes sense

for inputs that have variable size because they contain varying amounts of observation

of the same kind of thing–diﬀerent lengths of recordings over time, diﬀerent widths of

observations over space, etc. Convolution does not make sense if the input has variable

size because it can optionally include diﬀerent kinds of observations. For example,

if we are processing college applications, and our features consist of both grades and

188

Single channel Multi-channel

1-D Audio waveform: The axis we con-

volve over corresponds to time. We

discretize time and measure the

amplitude of the waveform once per

time step.

Skeleton animation data: Ani-

mations of 3-D computer-rendered

characters are generated by alter-

ing the pose of a “skeleton” over

time. At each point in time, the

pose of the character is described

by a speciﬁcation of the angles of

each of the joints in the character’s

skeleton. Each channel in the data

we feed to the convolutional model

represents the angle about one axis

of one joint.

2-D Audio data that has been prepro-

cessed with a Fourier transform:

We can transform the audio wave-

form into a 2D array with diﬀer-

ent rows corresponding to diﬀerent

frequencies and diﬀerent columns

corresponding to diﬀerent points in

time. Using convolution in the

time makes the model equivariant

to shifts in time. Using convolution

across the frequency axis makes the

model equivariant to frequency, so

that the same melody played in a

diﬀerent octave produces the same

representation but at a diﬀerent

height in the network’s output.

Color image data: One channel

contains the red pixels, one the

green pixels, and one the blue pix-

els. The convolution kernel moves

over both the horizontal and ver-

tical axes of the image, conferring

translation equivariance in both di-

rections.

3-D Volumetric data: A common source

of this kind of data is medical imag-

ing technology, such as CT scans.

Color video data: One axis corre-

sponds to time, one to the height

of the video frame, and one to the

width of the video frame.

Table 9.1: Examples of diﬀerent formats of data that can be used with convolutional

networks.

189

standardized test scores, but not every applicant took the standardized test, then it

does not make sense to convolve the same weights over both the features corresponding

to the grades and the features corresponding to the test scores.

9.6 Eﬃcient Convolution Algorithms

Modern convolutional network applications often involve networks containing more than

one million units. Powerful implementations exploiting parallel computation resources

are essential. TODO: add reference to eﬃcient implementations section of applications

chapter. However, it is also possible to obtain an algorithmic speedup in many cases.

Convolution is equivalent to converting both the input and the kernel to the fre-

quency domain using a Fourier transform, performing point-wise multiplication of the

two signals, and converting back to the time domain using an inverse Fourier transform.

For some problem sizes, this can be faster than the naive implementation of discrete

convolution.

When a d-dimensional kernel can be expressed as the outer product of d vectors, one

vector per dimension, the kernel is called separable. When the kernel is separable, naive

convolution is ineﬃcient. It is equivalent to compose d one-dimensional convolutions

with each of these vectors. The composed approach is signiﬁcantly faster than perform-

ing one k-dimensional convolution with their outer product. The kernel also takes fewer

parameters to represent as vectors. If the kernel is w elements wide in each dimension,

then naive multidimensional convolution requires O(w

) runtime and parameter storage

space, while separable convolution requires O(w × d) runtime and parameter storage

space. Of course, not every convolution can be represented in this way.

Devising faster ways of performing convolution or approximate convolution without

harming the accuracy of the model is an active area of research. Even techniques that

improve the eﬃciency of only forward propagation are useful because in the commercial

setting, it is typical to devote many more resources to deployment of a network than to

its training.

9.7 Deep Learning History

TODO: deep learning history, conv nets were ﬁrst really successful deep net

TODO neocognitron

TODO: ﬁgure out where in the book to put: TODOsectionPooling and Linear-by-

Part Non-Linearities: Drednets and Maxout

190