Chapter 9

Convolutional Networks

Convolutional networks (also known as convolutional neural networks or CNNs)

are a specialized kind of neural network for processing data that has a known,

grid-like topology. Examples include time-series data, which can be thought of

as a 1D grid taking samples at regular time intervals, and image data, which

can be thought of as a 2D grid of pixels. Convolutional networks have been

tremendously successful in practical applications. See chapter 13.1.2 for details.

The name “convolutional neural network” indicates that the network employs

a mathematical operation called convolution. Convolution is a specialized kind

of linear operation. Convolutional networks are simply neural networks

that use convolution in place of general matrix multiplication. KEY

IDEAIn this chapter, we will ﬁrst describe what convolution is. Next, we will

explain the motivation behind using convolution in a neural network. We will

then describe an operation called pooling, which almost all convolutional networks

employ. Usually, the operation used in a convolutional neural network does not

correspond precisely to the deﬁnition of convolution as used in other ﬁelds such

as engineering or pure mathematics. We will describe several variants on the

convolution function that are widely used in practice for neural networks. We

will also show how convolution may be applied to many kinds of data, with

diﬀerent numbers of dimensions. We then discuss means of making convolution

more eﬃcient. We conclude with comments about the role convolutional networks

have played in the history of deep learning.

9.1 The Convolution Operation

In its most general form, convolution is an operation on two functions of a real-

valued argument. To motivate the deﬁnition of convolution, let’s start with ex-

amples of two functions we might use.

224

CHAPTER 9. CONVOLUTIONAL NETWORKS

Suppose we are tracking the location of a spaceship with a laser sensor. Our

laser sensor provides a single output x(t), the position spaceship at time t. Both

x and t are real-valued, i.e., we can get a diﬀerent reading from the laser sensor

at any instant in time.

Now suppose that our laser sensor is somewhat noisy. To obtain a less noisy

estimate of the spaceship’s position, we would like to average together several

measurements. Of course, more recent measurements are more relevant, so we

will want this to be a weighted average that gives more weight to recent measure-

ments. We can do this with a weighting function w(a), where a is the age of a

measurement. If we apply such a weighted average operation at every moment,

we obtain a new function s providing a smoothed estimate of the position of the

spaceship:

s(t) =

x(a)w(t −a)da

This operation is called convolution. The convolution operation is typically

denoted with an asterisk:

s(t) = (x ∗w)(t)

In our example, w needs to be a valid probability density function, or the

output is not a weighted average. Also, w needs to be 0 for all negative argu-

ments, or it will look into the future, which is presumably beyond our capabilities.

These limitations are particular to our example though. In general, convolution

is deﬁned for any functions for which the above integral is deﬁned, and may be

used for other purposes besides taking weighted averages.

In convolutional network terminology, the ﬁrst argument (in this example,

the function x) to the convolution is often referred to as the input and the second

argument (in this example, the function w) as the kernel. The output is sometimes

referred to as the feature map.

In our example, the idea of a laser sensor that can provide measurements at

every instant in time is not realistic. Usually, when we work with data on a

computer, time will be discretized, and our sensor will provide data at regular

intervals. In our example, it might be more realistic to assume that our laser

provides one measurement once per second. t can then take on only integer

values. If we now assume that x and w are deﬁned only on integer t, we can

deﬁne the discrete convolution:

s[t] = (x ∗w)(t) =

∞

a=−∞

x[a]w[t −a]

225

CHAPTER 9. CONVOLUTIONAL NETWORKS

In machine learning applications, the input is usually an array of data and

the kernel is usually an array of learn-able parameters. Because each element of

the input and kernel must be explicitly stored separately, we usually assume that

these functions are zero everywhere but the ﬁnite set of points for which we store

the values. This means that in practice we can implement the inﬁnite summation

as a summation over a ﬁnite number of array elements.

Finally, we often use convolutions over more than one axis at a time. For

example, if we use a two-dimensional image I as our input, we probably also

want to use a two-dimensional kernel K:

s[i, j] = (I ∗K)[i, j] =

I[m, n]K[i −m, j − n]

Note that convolution is commutative, meaning we can equivalently write:

s[i, j] = (I ∗K)[i, j] =

I[i −m, j − n]K[m, n]

Usually the latter view is more straightforward to implement in a machine

learning library, because there is less variation in the range of valid values of m

and n.

Discrete convolution can be viewed as multiplication by a matrix. However,

the matrix has several entries constrained to be equal to other entries. For exam-

ple, for univariate discrete convolution, each row of the matrix is constrained to be

equal to the row above shifted by one element. This is known as a Toeplitz matrix.

In two dimensions, a doubly block circulant matrix corresponds to convolution. In

addition to these constraints that several elements be equal to each other, con-

volution usually corresponds to a very sparse matrix (a matrix whose entries are

mostly equal to zero). This is because the kernel is usually much smaller than

the input image. Viewing convolution as matrix multiplication usually does not

help to implement convolution operations, but it is useful for understanding and

designing neural networks. Any neural network algorithm that works with matrix

multiplication and does not depend on speciﬁc properties of the matrix structure

should work with convolution, without requiring any further changes to the neural

network. Typical convolutional neural networks do make use of further special-

izations in order to deal with large inputs eﬃciently, but these are not strictly

necessary from a theoretical perspective.

9.2 Motivation

Convolution leverages three important ideas that can help improve a machine

learning system: sparse interactions, parameter sharing, and equivariant repre-

226

CHAPTER 9. CONVOLUTIONAL NETWORKS

Figure 9.1: Sparse connectivity, viewed from below: We highlight one input unit, x

, and

also highlight the output units in s that are aﬀected by this unit. (Left) When s is

formed by convolution with a kernel of width 3, only three outputs are aﬀected by x

(Right) When s is formed by matrix multiplication, connectivity is no longer sparse, so

all of the outputs are aﬀected by x

sentations. Moreover, convolution provides a means for working with inputs of

variable size. We now describe each of these ideas in turn.

Traditional neural network layers use a matrix multiplication to describe the

interaction between each input unit and each output unit. This means every

output unit interacts with every input unit. Convolutional networks, however,

typically have sparse interactions (also referred to as sparse connectivity or sparse

weights). This is accomplished by making the kernel smaller than the input. For

example, when processing an image, the input image might have thousands or

millions of pixels, but we can detect small, meaningful features such as edges

with kernels that occupy only tens or hundreds of pixels. This means that we

need to store fewer parameters, which both reduces the memory requirements of

the model and improves its statistical eﬃciency. It also means that computing the

output requires fewer operations. These improvements in eﬃciency are usually

quite large. If there are m inputs and n outputs, then matrix multiplication

requires m × n parameters and the algorithms used in practice have O(m × n)

runtime (per example). If we limit the number of connections each output may

have to k, then the sparsely connected approach requires only k × n parameters

and O(k × n) runtime. For many practical applications, it is possible to obtain

good performance on the machine learning task while keeping k several orders of

magnitude smaller than m. For graphical demonstrations of sparse connectivity,

see Fig. 9.1 and Fig. 9.2. In a deep convolutional network, units in the deeper

layers may indirectly interact with a larger portion of the input, as shown in

Fig. 9.3. This allows the network to eﬃciently describe complicated interactions

between many variables by constructing such interactions from simple building

blocks that each describe only sparse interactions.

Parameter sharing refers to using the same parameter for more than one func-

tion in a model. In a traditional neural net, each element of the weight matrix

is used exactly once when computing the output of a layer. It is multiplied by

one element of the input, and then never revisited. As a synonym for parameter

227

CHAPTER 9. CONVOLUTIONAL NETWORKS

Figure 9.2: Sparse connectivity, viewed from above: We highlight one output unit, s

and also highlight the input units in x that aﬀect this unit. These units are known as the

receptive ﬁeld of s

. (Left) When s is formed by convolution with a kernel of width 3, only

three inputs aﬀect s

. (Right) When s is formed by matrix multiplication, connectivity

is no longer sparse, so all of the inputs aﬀect s

sharing, one can say that a network has tied weights, because the value of the

weight applied to one input is tied to the value of a weight applied elsewhere. In

a convolutional neural net, each member of the kernel is used at every position of

the input (except perhaps some of the boundary pixels, depending on the design

decisions regarding the boundary). The parameter sharing used by the convolu-

tion operation means that rather than learning a separate set of parameters for

every location, we learn only one set. This does not aﬀect the runtime of forward

propagation–it is still O(k×n)–but it does further reduce the storage requirements

of the model to k parameters. Recall that k is usual several orders of magnitude

less than m. Since m and n are usually roughly the same size, k is practically

insigniﬁcant compared to m ×n. Convolution is thus dramatically more eﬃcient

than dense matrix multiplication in terms of the memory requirements and sta-

tistical eﬃciency. For a graphical depiction of how parameter sharing works, see

Fig. 9.4.

As an example of both of these ﬁrst two principles in action, Fig. 9.5 shows

how sparse connectivity and parameter sharing can dramatically improve the

eﬃciency of a linear function for detecting edges in an image.

In the case of convolution, the particular form of parameter sharing causes the

layer to have a property called equivariance to translation. To say a function is

equivariant means that if the input changes, the output changes in the same way.

Speciﬁcally, a function f(x) is equivariant to a function g if f(g(x)) = g(f (x)). In

the case of convolution, if we let g be any function that translate the input, i.e.,

shifts it, then the convolution function is equivariant to g. For example, deﬁne

g(x) such that for all i, g(x)[i] = x[i − 1]. This shifts every element of x one

228

CHAPTER 9. CONVOLUTIONAL NETWORKS

Figure 9.3: The receptive ﬁeld of the units in the deeper layers of a convolutional network

is larger than the receptive ﬁeld of the units in the shallow layers. This eﬀect increases

if the network includes architectural features like strided convolution or pooling. This

means that even though direct connections in a convolutional net are very sparse, units

in the deeper layers can be indirectly connected to all or most of the input image.

229

CHAPTER 9. CONVOLUTIONAL NETWORKS

Figure 9.4: Parameter sharing: We highlight the connections that use a particular pa-

rameter in two diﬀerent models. (Left) We highlight uses of the central element of a

3-element kernel in a convolutional model. Due to parameter sharing, this single param-

eter is used at all input locations. (Right) We highlight the use of the central element of

the weight matrix in a fully connected model. This model has no parameter sharing so

the parameter is used only once.

unit to the right. If we apply this transformation to x, then apply convolution,

the result will be the same as if we applied convolution to x, then applied the

transformation to the output. When processing time series data, this means that

convolution produces a sort of timeline that shows when diﬀerent features appear

in the input. If we move an event later in time in the input, the exact same

representation of it will appear in the output, just later in time. Similarly with

images, convolution creates a 2-D map of where certain features appear in the

input. If we move the object in the input, its representation will move the same

amount in the output. This is useful for when we know that same local function

is useful everywhere in the input. For example, when processing images, it is

useful to detect edges in the ﬁrst layer of a convolutional network, and an edge

looks the same regardless of where it appears in the image. This property is not

always useful. For example, if we want to recognize a face, some portion of the

network needs to vary with spatial location, because the top of a face does not

look the same as the bottom of a face–the part of the network processing the top

of the face needs to look for eyebrows, while the part of the network processing

the bottom of the face needs to look for a chin.

Note that convolution is not equivariant to some other transformations, such

as changes in the scale or rotation of an image. Other mechanisms are necessary

for handling these kinds of transformations.

Finally, some kinds of data cannot be processed by neural networks deﬁned by

matrix multiplication with a ﬁxed-shape matrix. Convolution enables processing

of some of these kinds of data. We discuss this further in section 9.6.

230

CHAPTER 9. CONVOLUTIONAL NETWORKS

Figure 9.5: Eﬃciency of edge detection. The image on the right was formed by taking

each pixel in the original image and subtracting the value of its neighboring pixel on

the left. This shows the strength of all of the vertically oriented edges in the input

image, which can be a useful operation for object detection. Both images are 280 pixels

tall. The input image is 320 pixels wide while the output image is 319 pixels wide.

This transformation can be described by a convolution kernel containing 2 elements, and

requires 319 × 280 ×3 = 267, 960 ﬂoating point operations (two multiplications and one

addition per output pixel) to compute. To describe the same transformation with a matrix

multiplication would take 320 ×280 ×319 ×280, or over 8 billion, entries in the matrix,

making convolution 4 billion times more eﬃcient for representing this transformation. The

straightforward matrix multiplication algorithm performs over 16 billion ﬂoating point

operations, making convolution roughly 60,000 times more eﬃcient computationally. Of

course, most of the entries of the matrix would be zero. If we stored only the nonzero

entries of the matrix, then both matrix multiplication and convolution would require the

same number of ﬂoating point operations to compute. The matrix would still need to

contain 2 × 319 × 280 = 178, 640 entries. Convolution is an extremely eﬃcient way of

describing transformations that apply the same linear transformation of a small, local

region across the entire input. (Photo credit: Paula Goodfellow)

231

CHAPTER 9. CONVOLUTIONAL NETWORKS

9.3 Pooling

A typical layer of a convolutional network consists of three stages (see Fig. 9.6).

In the ﬁrst stage, the layer performs several convolutions in parallel to produce a

set of presynaptic activations. In the second stage, each presynaptic activation is

run through a nonlinear activation function, such as the rectiﬁed linear activation

function. This stage is sometimes called the detector stage. In the third stage,

we use a pooling function to modify the output of the layer further.

A pooling function replaces the output of the net at a certain location with a

summary statistic of the nearby outputs. For example, the max pooling operation

reports the maximum output within a rectangular neighborhood. Other popular

pooling functions include the average of a rectangular neighborhood, the L2 norm

of a rectangular neighborhood, or a weighted average based on the distance from

the central pixel.

In all cases, pooling helps to make the representation become invariant to

small translations of the input. This means that if we translate the input by

a small amount, the values of most of the pooled outputs do not change. See

Fig. 9.7 for an example of how this works. Invariance to local translation

can be a very useful property if we care more about whether some

feature is present than exactly where it is. For example, when determining KEY

IDEAwhether an image contains a face, we need not know the location of the eyes with

pixel-perfect accuracy, we just need to know that there is an eye on the left side

of the face and an eye on the right side of the face. In other contexts, it is more

important to preserve the location of a feature. For example, if we want to ﬁnd a

corner deﬁned by two edges meeting at a speciﬁc orientation, we need to preserve

the location of the edges well enough to test whether they meet.

The use of pooling can be viewed as adding an inﬁnitely strong prior that

the function the layer learns must be invariant to small translations. When this

assumption is correct, it can greatly improve the statistical eﬃciency of the net-

work.

Pooling over spatial regions produces invariance to translation, but if we pool

over the outputs of separately parametrized convolutions, the features can learn

which transformations to become invariant to (see Fig. 9.8).

Because pooling summarizes the responses over a whole neighborhood, it is

possible to use fewer pooling units than detector units, by reporting summary

statistics for pooling regions spaced k pixels apart rather than 1 pixel apart. See

Fig. 9.9 for an example. This improves the computational eﬃciency of the network

because the next layer has roughly k times fewer inputs to process. When the

number of parameters in the next layer is a function of its input size (such as

when the next layer is fully connected and based on matrix multiplication) this

reduction in the input size can also result in improved statistical eﬃciency and

232

CHAPTER 9. CONVOLUTIONAL NETWORKS

Figure 9.6: The components of a typical convolutional neural network layer. There are two

commonly used sets of terminology for describing these layers. Left) In this terminology,

the convolutional net is viewed as a small number of relatively complex layers, with each

layer having many “stages.” In this terminology, there is a one-to-one mapping between

kernel tensors and network layers. In this book we generally use this terminology. Right)

In this terminology, the convolutional net is viewed as a larger number of simple layers;

every step of processing is regarded as a layer in its own right. This means that not every

“layer” has parameters.

233

CHAPTER 9. CONVOLUTIONAL NETWORKS

Figure 9.7: Max pooling introduces invariance. Left: A view of the middle of the output

of a convolutional layer. The bottom row shows outputs of the nonlinearity. The top

row shows the outputs of max pooling, with a stride of 1 between pooling regions and a

pooling region width of 3. Right: A view of the same network, after the input has been

shifted to the right by 1 pixel. Every value in the bottom row has changed, but only

half of the values in the top row have changed, because the max pooling units are only

sensitive to the maximum value in the neighborhood, not its exact location.

Figure 9.8: Example of learned invariances: If each of these ﬁlters drive units that appear

in the same max-pooling region, then the pooling unit will detect “5”s in any rotation.

By learning to have each ﬁlter be a diﬀerent rotation of the “5” template, this pooling

unit has learned to be invariant to rotation. This is in contrast to translation invariance,

which is usually achieved by hard-coding the net to pool over shifted versions of a single

learned ﬁlter.

234

CHAPTER 9. CONVOLUTIONAL NETWORKS

0.1 1. 0.2

1. 0.2

0.1

0.0 0.1

Figure 9.9: Pooling with downsampling. Here we use max-pooling with a pool width of

3 and a stride between pools of 2. This reduces the representation size by a factor of 2,

which reduces the computational and statistical burden on the next layer. Note that the

ﬁnal pool has a smaller size, but must be included if we do not want to ignore some of

the detector units.

reduced memory requirements for storing the parameters.

For many tasks, pooling is essential for handling inputs of varying size. For

example, if we want to classify images of variable size, the input to the classiﬁ-

cation layer must have a ﬁxed size. This is usually accomplished by varying the

size of and oﬀset between pooling regions so that the classiﬁcation layer always

receives the same number of summary statistics regardless of the input size. For

example, the ﬁnal pooling layer of the network may be deﬁned to output four sets

of summary statistics, one for each quadrant of an image, regardless of the image

size.

9.4 Convolution and Pooling as an Inﬁnitely Strong

Prior

Recall the concept of a prior probability distribution from Chapter 5.3. This is a

probability distribution over the parameters of a model that encodes our beliefs

about what models are reasonable, before we have seen any data.

Priors can be considered weak or strong depending on how concentrated the

probability density in the prior is. A weak prior is a prior distribution with high

entropy, such a Gaussian distribution with high variance. Such a prior allows the

data to move the parameters more or less freely. A strong prior has very low

entropy, such as a Gaussian distribution with low variance. Such a prior plays a

more active role in determining where the parameters end up.

An inﬁnitely strong prior places zero probability on some parameters and says

that these parameter values are completely forbidden, regardless of how much

support the data gives to those values.

We can imagine a convolutional net as being similar to a fully connected net,

235

CHAPTER 9. CONVOLUTIONAL NETWORKS

but with an inﬁnitely strong prior over its weights. This inﬁnitely strong prior

says that the weights for one hidden unit must be identical to the weights of its

neighbor, but shifted in space. The prior also says that the weights must be zero,

except for in the small, spatially contiguous receptive ﬁeld assigned to that hidden

unit. Overall, we can think of the use of convolution as introducing an inﬁnitely

strong prior probability distribution over the parameters of a layer. This prior

says that the function the layer should learn contains only local interactions and

is equivariant to translation. Likewise, the use of pooling is in inﬁnitely strong

prior that each unit should be invariant to small translations.

Of course, implementing a convolutional net as a fully connected net with an

inﬁnitely strong prior would be extremely computationally wasteful. But thinking

of a convolutional net as a fully connected net with an inﬁnitely strong prior can

give us some insights into how convolutional nets work.

This view of convolution and pooling as an inﬁnitely strong prior gives a few

insights into how convolution and pooling work.

One key insight is that convolution and pooling can cause underﬁtting. Like

any prior, convolution and pooling are only useful when the assumptions made

by the prior are reasonably accurate. If a task relies on preserving precision

spatial information, then using pooling on all features can cause underﬁtting.

(Some convolution network architectures (Szegedy et al., 2014) are designed to

use pooling on some channels but not on other channels, in order to get both

highly invariant features and features that will not underﬁt when the translation

invariance prior is incorrect) When a task involves incorporating information from

very distant locations in the input, then the prior imposed by convolution may

be inappropriate.

Another key insight from this view is that we should only compare convolu-

tional models to other convolutional models in benchmarks of statistical learning

performance. Models that do not use convolution would be able to learn even

if we permuted all of the pixels in the image. For many image datasets, there

are separate benchmarks for models that are permutation invariant and must dis-

cover the concept of topology via learning, and models that have the knowledge

of spatial relationships hard-coded into them by their designer.

9.5 Variants of the Basic Convolution Function

When discussing convolution in the context of neural networks, we usually do

not refer exactly to the standard discrete convolution operation as it is usually

understood in the mathematical literature. The functions used in practice diﬀer

slightly. Here we describe these diﬀerences in detail, and highlight some useful

properties of the functions used in neural networks.

236

CHAPTER 9. CONVOLUTIONAL NETWORKS

First, when we refer to convolution in the context of neural networks, we usu-

ally actually mean an operation that consists of many applications of convolution

in parallel. This is because convolution with a single kernel can only extract one

kind of feature, albeit at many spatial locations. Usually we want each layer of

our network to extract many kinds of features, at many locations.

Additionally, the input is usually not just a grid of real values. Rather, it is a

grid of vector-valued observations. For example, a color image has a red, green,

and blue intensity at each pixel. In a multilayer convolutional network, the input

to the second layer is the output of the ﬁrst layer, which usually has the output

of many diﬀerent convolutions at each position. When working with images, we

usually think of the input and output of the convolution as being 3-D arrays, with

one index into the diﬀerent channels and two indices into the spatial coordinates

of each channel. (Software implementations usually work in batch mode, so they

will actually use 4-D arrays, with the fourth axis indexing diﬀerent examples in

the batch)

Note that because convolutional networks usually use multi-channel convolu-

tion, the linear operations they are based on are not guaranteed to be commu-

tative, even if kernel-ﬂipping is used. These multi-channel operations are only

commutative if each operation has the same number of output channels as input

channels.

Convolution is traditionally deﬁned to index the kernel in the opposite or-

der as the input. This deﬁnition makes convolution commutative, which can be

convenient for writing proofs. In the context of neural networks, the commuta-

tive property of convolution is not especially helpful, so many implementations of

convolution index both the kernel and the input in the same order.

Assume we have a 4-D kernel array K with element k

i,j,k,l

giving the connec-

tion strength between a unit in channel i of the output and a unit in channel

j of the input, with an oﬀset of k rows and l columns between the output unit

and the input unit. Assume our input consists of observed data V with element

i,j,k

giving the value of the input unit within channel i at row j and column k.

Assume our output consists of Z with the same format as V. If Z is produced by

convolving K across V without ﬂipping K, then

i,j,k

l,m,n

l,j+m,k+n

i,l,m,n

where the summation over l, m, and n is over all values for which the array

indexing operations inside the summation is valid.

We may also want to skip over some positions of the kernel in order to reduce

the computational cost (at the expense of not extracting our features as ﬁnely).

We can think of this as downsampling the output of the full convolution function.

237

CHAPTER 9. CONVOLUTIONAL NETWORKS

If we want to sample only every s pixels in each direction in the output, then we

can deﬁned a downsampled convolution function c such that:

i,j,k

= c(K, V , s)

i,j,k

l,m,n

l,j×s+m,k×s+n

i,l,m,n

] . (9.1)

We refer to s as the stride of this downsampled convolution. It is also possible

to deﬁne a separate stride for each direction of motion. TODO add a ﬁgure for

this (if it’s not already in Mehdi’s list)

One essential feature of any convolutional network implementation is the abil-

ity to implicitly zero-pad the input V in order to make it wider. Without this

feature, the width of the representation shrinks by the kernel width - 1 at each

layer. Zero padding the input allows us to control the kernel width and the size of

the output independently. Without zero padding, we are forced to choose between

shrinking the spatial extent of the network rapidly and using small kernels–both

scenarios that signiﬁcantly limit the expressive power of the network. See Fig. 9.10

for an example.

Three special cases of the zero-padding setting are worth mentioning. One is

the extreme case in which no zero-padding is used whatsoever, and the convolution

kernel is only allowed to visit positions where the entire kernel is contained entirely

within the image. In MATLAB terminology, this is called valid convolution. In

this case, all pixels in the output are a function of the same number of pixels in

the input, so the behavior of an output pixel is somewhat more regular. However,

the size of the output shrinks at each layer. If the input image is of size m×m and

the kernel is of size k ×k, the output will be of size m−k +1×m−k + 1. The rate

of this shrinkage can be dramatic if the kernels used are large. Since the shrinkage

is greater than 0, it limits the number of convolutional layers that can be included

in the network. As layers are added, the spatial dimension of the network will

eventually drop to 1 × 1, at which point additional layers cannot meaningfully

be considered convolutional. Another special case of the zero-padding setting

is when just enough zero-padding is added to keep the size of the output equal

to the size of the input. MATLAB calls this same convolution. In this case,

the network can contain as many convolutional layers as the available hardware

can support, since the operation of convolution does not modify the architectural

possibilities available to the next layer. However, the input pixels near the border

inﬂuence fewer output pixels than the input pixels near the center. This can

make the border pixels somewhat underrepresented in the model. This motivates

the other extreme case, which MATLAB refers to as full convolution, in which

enough zeroes are added for every pixel to be visited k times in each direction,

resulting in an output image of size m+k −1×m+ k −1. In this case, the output

pixels near the border are a function of fewer pixels than the output pixels near

238

CHAPTER 9. CONVOLUTIONAL NETWORKS

Figure 9.10: The eﬀect of zero padding on network size: Consider a convolutional network

with a kernel of width six at every layer. In this example, do not use any pooling, so

only the convolution operation itself shrinks the network size. Top) In this convolutional

network, we do not use any implicit zero padding. This causes the representation to

shrink by ﬁve pixels at each layer. Starting from an input of sixteen pixels, we are only

able to have three convolutional layers, and the last layer does not ever move the kernel,

so arguably only two of the layers are truly convolutional. The rate of shrinking can

be mitigated by using smaller kernels, but smaller kernels are less expressive and some

shrinking is inevitable in this kind of architecture. Bottom) By adding ﬁve implicit zeroes

to each layer, we prevent the representation from shrinking with depth. This allows us

to make an arbitrarily deep convolutional network.

239

CHAPTER 9. CONVOLUTIONAL NETWORKS

the center. This can make it diﬃcult to learn a single kernel that performs well

at all positions in the convolutional feature map. Usually the optimal amount of

zero padding (in terms of test set classiﬁcation accuracy) lies somewhere between

“valid” and “same” convolution.

In some cases, we do not actually want to use convolution, but rather locally

connected layers. In this case, the adjacency matrix in the graph of our MLP is

the same, but every connection has its own weight, speciﬁed by a 6-D array W:

TODO– DWF wants better explanation of the 6 dimensions

i,j,k

l,m,n

l,j+m,k+n

i,j,k,l,m,n

] .

Locally connected layers are useful when we know that each feature should

be a function of a small part of space, but there is no reason to think that the

same feature should occur across all of space. For example, if we want to tell if

an image is a picture of a face, we only need to look for the mouth in the bottom

half of the image.

It can also be useful to make versions of convolution or locally connected layers

in which the connectivity is further restricted, for example to constraint that each

output channel i be a function of only a subset of the input channels l.

Tiled convolution oﬀers a compromise between a convolutional layer and a

locally connected layer. We make k be a 6-D tensor, where two of the dimensions

correspond to diﬀerent locations in the output map. Rather than having a sep-

arate index for each location in the output map, output locations cycle through

a set of t diﬀerent choices of kernel stack in each direction. If t is equal to the

output width, this is the same as a locally connected layer.

i,j,k

l,m,n

l,j+m,k+n

i,l,m,n,j%t,k%t

Note that it is straightforward to generalize this equation to use a diﬀerent

tiling range for each dimension.

Both locally connected layers and tiled convolutional layers have an interesting

interaction with max-pooling: the detector units of these layers are driven by

diﬀerent ﬁlters. If these ﬁlters learn to detect diﬀerent transformed versions of

the same underlying features, then the max-pooled units become invariant to the

learned transformation (see Fig. 9.8). Convolutional layers are hard-coded to be

invariant speciﬁcally to translation.

Other operations besides convolution are usually necessary to implement a

convolutional network. To perform learning, one must be able to compute the

gradient with respect to the kernel, given the gradient with respect to the outputs.

In some simple cases, this operation can be performed using the convolution

240

CHAPTER 9. CONVOLUTIONAL NETWORKS

operation, but many cases of interest, including the case of stride greater than 1,

do not have this property.

Convolution is a linear operation and can thus be described as a matrix mul-

tiplication (if we ﬁrst reshape the input array into a ﬂat vector). The matrix

involved is a function of the convolution kernel. The matrix is sparse and each

element of the kernel is copied to several elements of the matrix. It is not usually

practical to implement convolution in this way, but it can be conceptually useful

to think of it in this way.

Multiplication by the transpose of the matrix deﬁned by convolution is also a

useful operation. This is the operation needed to backpropagate error derivatives

through a convolutional layer, so it is needed to train convolutional networks

that have more than one hidden layer. This same operation is also needed to

compute the reconstruction in a convolutional autoencoder (or to perform the

analogous role in a convolutional RBM, sparse coding model, etc.). Like the

kernel gradient operation, this input gradient operation can be implemented using

a convolution in some cases, but in the general case requires a third operation to be

implemented. Care must be taken to coordinate this transpose operation with the

forward propagation. The size of the output that the transpose operation should

return depends on the zero padding policy and stride of the forward propagation

operation, as well as the size of the forward propagation’s output map. In some

cases, multiple sizes of input to forward propagation can result in the same size

of output map, so the transpose operation must be explicitly told what the size

of the original input was.

It turns out that these three operations–convolution, backprop from output

to weights, and backprop from output to inputs–are suﬃcient to compute all of

the gradients needed to train any depth of feedforward convolutional network,

as well as to train convolutional networks with reconstruction functions based

on the transpose of convolution. See (Goodfellow, 2010) for a full derivation

of the equations in the fully general multi-dimensional, multi-example case. To

give a sense of how these equations work, we present the two dimensional, single

example version here.

Suppose we want to train a convolutional network that incorporates strided

convolution of kernel stack K applied to multi-channel image V with stride s is

deﬁned by c(K, V , s) as in equation 9.1. Suppose we want to minimize some loss

function J(V , K). During forward propagation, we will need to use c itself to

output Z, which is then propagated through the rest of the network and used

to compute J. . During backpropagation, we will receive an array G such that

i,j,k

∂

∂z

i,j,k

J(V , K).

To train the network, we need to compute the derivatives with respect to the

241

CHAPTER 9. CONVOLUTIONAL NETWORKS

weights in the kernel. To do so, we can use a function

g(G, V , s)

i,j,k,l

∂

∂k

i,j,k,l

J(V , K) =

m,n

i,m,n

j,m×s+k,n×s+l

If this layer is not the bottom layer of the network, we’ll need to compute the

gradient with respect to V in order to backpropagate the error farther down. To

do so, we can use a function

h(K, G, s)

i,j,k

∂

∂v

i,j,k

J(V , K) =

l,m|s×l+m=j

n,p|s×n+p=k

q,i,m,p

i,l,n

We could also use h to deﬁne the reconstruction of a convolutional autoen-

coder, or the probability distribution over visible given hidden units in a convo-

lutional RBM or sparse coding model. Suppose we have hidden units H in the

same format as Z and we deﬁne a reconstruction

R = h(K, H, s).

In order to train the autoencoder, we will receive the gradient with respect

to R as an array E. To train the decoder, we need to obtain the gradient with

respect to K. This is given by g(H, E, s). To train the encoder, we need to obtain

the gradient with respect to H. This is given by c(K, E, s). It is also possible to

diﬀerentiate through g using c and h, but these operations are not needed for the

backpropagation algorithm on any standard network architectures.

Generally, we do not use only a linear operation in order to transform from the

inputs to the outputs in a convolutional layer. We generally also add some bias

term to each output before applying the nonlinearity. This raises the question

of how to share parameters among the biases. For locally connected layers it is

natural to give each unit its own bias, and for tiled convolution, it is natural to

share the biases with the same tiling pattern as the kernels. For convolutional

layers, it is typical to have one bias per channel of the output and share it across

all locations within each convolution map. However, if the input is of known,

ﬁxed size, it is also possible to learn a separate bias at each location of the output

map. Separating the biases may slightly reduce the statistical eﬃciency of the

model, but also allows the model to correct for diﬀerences in the image statistics

at diﬀerent locations. For example, when using implicit zero padding, detector

units at the edge of the image receive less total input and may need larger biases.

9.6 Data Types

The data used with a convolutional network usually consists of several channels,

each channel being the observation of a diﬀerent quantity at some point in space

242

CHAPTER 9. CONVOLUTIONAL NETWORKS

or time. See Table 9.1 for examples of data types with diﬀerent dimensionalities

and number of channels.

So far we have discussed only the case where every example in the train

and test data has the same spatial dimensions. One advantage to convolutional

networks is that they can also process inputs with varying spatial extents. These

kinds of input simply cannot be represented by traditional, matrix multiplication-

based neural networks. This provides a compelling reason to use convolutional

networks even when computational cost and overﬁtting are not signiﬁcant issues.

For example, consider a collection of images, where each image has a diﬀerent

width and height. It is unclear how to apply matrix multiplication. Convolution

is straightforward to apply; the kernel is simply applied a diﬀerent number of

times depending on the size of the input, and the output of the convolution

operation scales accordingly. Sometimes the output of the network is allowed to

have variable size as well as the input, for example if we want to assign a class

label to each pixel of the input. In this case, no further design work is necessary.

In other cases, the network must produce some ﬁxed-size output, for example if

we want to assign a single class label to the entire image. In this case we must

make some additional design steps, like inserting a pooling layer whose pooling

regions scale in size proportional to the size of the input, in order to maintain a

ﬁxed number of pooled outputs.

Note that the use of convolution for processing variable sized inputs only makes

sense for inputs that have variable size because they contain varying amounts of

observation of the same kind of thing–diﬀerent lengths of recordings over time,

diﬀerent widths of observations over space, etc. Convolution does not make sense

if the input has variable size because it can optionally include diﬀerent kinds

of observations. For example, if we are processing college applications, and our

features consist of both grades and standardized test scores, but not every ap-

plicant took the standardized test, then it does not make sense to convolve the

same weights over both the features corresponding to the grades and the features

corresponding to the test scores.

9.7 Eﬃcient Convolution Algorithms

Modern convolutional network applications often involve networks containing

more than one million units. Powerful implementations exploiting parallel compu-

tation resources, as discussed in Chapter 11 are essential. However, in many cases

it is also possible to speed up convolution by selecting an appropriate convolution

algorithm.

Convolution is equivalent to converting both the input and the kernel to the

frequency domain using a Fourier transform, performing point-wise multiplication

243

CHAPTER 9. CONVOLUTIONAL NETWORKS

Single channel Multi-channel

1-D Audio waveform: The axis we

convolve over corresponds to

time. We discretize time and

measure the amplitude of the

waveform once per time step.

Skeleton animation data:

Animations of 3-D computer-

rendered characters are gen-

erated by altering the pose of

a “skeleton” over time. At

each point in time, the pose

of the character is described

by a speciﬁcation of the angles

of each of the joints in the

character’s skeleton. Each chan-

nel in the data we feed to the

convolutional model represents

the angle about one axis of one

joint.

2-D Audio data that has been pre-

processed with a Fourier trans-

form: We can transform the au-

dio waveform into a 2D array

with diﬀerent rows correspond-

ing to diﬀerent frequencies and

diﬀerent columns corresponding

to diﬀerent points in time. Using

convolution in the time makes

the model equivariant to shifts in

time. Using convolution across

the frequency axis makes the

model equivariant to frequency,

so that the same melody played

in a diﬀerent octave produces

the same representation but at a

diﬀerent height in the network’s

output.

Color image data: One channel

contains the red pixels, one the

green pixels, and one the blue

pixels. The convolution kernel

moves over both the horizontal

and vertical axes of the image,

conferring translation equivari-

ance in both directions.

3-D Volumetric data: A common

source of this kind of data

is medical imaging technology,

such as CT scans.

Color video data: One axis cor-

responds to time, one to the

height of the video frame, and

one to the width of the video

frame.

Table 9.1: Examples of diﬀerent formats of data that can be used with convolutional

networks.

244

CHAPTER 9. CONVOLUTIONAL NETWORKS

of the two signals, and converting back to the time domain using an inverse

Fourier transform. For some problem sizes, this can be faster than the naive

implementation of discrete convolution.

When a d-dimensional kernel can be expressed as the outer product of d

vectors, one vector per dimension, the kernel is called separable. When the kernel

is separable, naive convolution is ineﬃcient. It is equivalent to compose d one-

dimensional convolutions with each of these vectors. The composed approach

is signiﬁcantly faster than performing one k-dimensional convolution with their

outer product. The kernel also takes fewer parameters to represent as vectors.

If the kernel is w elements wide in each dimension, then naive multidimensional

convolution requires O(w

) runtime and parameter storage space, while separable

convolution requires O(w × d) runtime and parameter storage space. Of course,

not every convolution can be represented in this way.

Devising faster ways of performing convolution or approximate convolution

without harming the accuracy of the model is an active area of research. Even

techniques that improve the eﬃciency of only forward propagation are useful

because in the commercial setting, it is typical to devote many more resources to

deployment of a network than to its training.

9.8 The Neuroscientiﬁc Basis for Convolutional Net-

works

Convolutional networks are perhaps the greatest success story of biologically in-

spired artiﬁcial intelligence. Though convolutional networks have been guided

by many other ﬁelds, some of the key design principles of neural networks were

drawn from neuroscience.

The history of convolutional networks begins with neuroscientiﬁc experiments

long before the relevant computational models were developed. Neurophysiol-

ogists David Hubel and Torsten Wiesel collaborated for several years to deter-

mine many of the most basic facts about how the mammalian vision system

works (Hubel and Wiesel, 1959, 1962, 1968). Their accomplishements were even-

tually recognized with a Nobel Prize. Their ﬁndings that have had the greatest

inﬂuence on contemporary deep learning models were based on recording the

activity of individual neurons in cats. By anesthetizing the cat, they could im-

mobilize the cat’s eye and observe how neurons in the cat’s brain responded to

images projected in precise locations on a screen in front of the cat.

Their worked helped to characterize many aspects of brain function that are

beyond the scope of this book. From the point of view of deep learning, we can

focus on a simpliﬁed, cartoon view of brain function.

In this simpliﬁed view, we focus on a part of the brain called V1, also known

245

CHAPTER 9. CONVOLUTIONAL NETWORKS

as the primary visual cortex. V1 is the ﬁrst area of the brain that begins to

perform signiﬁcantly advanced processing of visual input. In this cartoon view,

images are formed by light arriving in the eye and stimulating the retina, the

light-sensitive tissue in the back of the eye. The neurons in the retina perform

some simple preprocessing of the image but do not substantially alter the way it

is represented. The image then passes through the optic nerve and a brain region

called the lateral geniculate nucleus. The main role, as far as we are concerned

here, of both of these anatomical regions is primarily just to carry the signal from

the eye to V1, which is located at the back of the head.

A convolutional network layer is designed to capture three properties of V1:

1. V1 is arranged in a spatial map. It actually has a two-dimensional struc-

ture mirroring the structure of the image in the retina. For example, light

arriving at the lower half of the retina aﬀects only the corresponding half of

V1. Convolutional networks capture this property by having their features

deﬁned in terms of two dimensional maps.

2. V1 contains many simple cells. A simple cell’s activity can to some extent be

characterized by a linear function of the image in a small, spatially localized

receptive ﬁeld. The detector units of a convolutional network are designed

to emulate these properties of simple cells. V1 also contains many complex

cells. These cells respond to features that are similar to those detected by

simple cells, but complex cells are invariant to small shifts in the position

of the feature. This inspires the pooling units of convolutional networks.

Complex cells are also invariant to some changes in lighting that cannot

be captured simply by pooling over spatial locations. These invariances

have inspired some of the cross-channel pooling strategies in convolutional

networks, such as maxout units (Goodfellow et al., 2013a).

Though we know the most about V1, it is generally believed that the same

basic principles apply to other brain regions. In our cartoon view of the visual

system, the basic strategy of detection followed by pooling is repeatedly applied

as we move deeper into the brain. As we pass through multiple anatomical layers

of the brain, we eventually ﬁnd cells that respond to some speciﬁc concept and are

invariant to many transformations of the input. These cells have been nicknamed

“grandmother cells”— the idea is that a person could have a neuron that activates

when seeing an image of their grandmother, regardless of whether she appears in

the left or right side of the image, whether the image is a close-up of her face or

zoomed out shot of her entire body, whether she is brightly lit, or in shadow, etc.

These grandmother cells have been shown to actually exist in the human brain,

in a region called the medial temporal lobe (Quiroga et al., 2005). Researchers

tested whether individual neurons would respond to photos of famous individuals,

246

CHAPTER 9. CONVOLUTIONAL NETWORKS

and found what has come to be called the “Halle Berry neuron”: an individual

neuron that is activated by the concept of Halle Berry. This neuron ﬁres when

a person sees a photo of Halle Berry, a drawing of Halle Berry, or even text

containing the words “Halle Berry.” Of course, this has nothing to do with Halle

Berry herself; other neurons responded to the presence of Bill Clinton, Jennifer

Aniston, etc.

These medial temporal lobe neurons are somewhat more general than modern

convolutional networks, which would not automatically generalize to identifying

a person or object when reading its name. The closest analog to a convolutional

network’s last layer of features is a brain area called the inferotemporal cortex

(IT). When viewing an object, information ﬂows from the retina, through the

LGN, to V1, then onward to V2, then V4, then IT. This happens within the ﬁrst

100ms of glimpsing an object. If a person is allowed to continue looking at the

object for more time, then information will begin to ﬂow backwards as the brain

uses top-down feedback to update the activations in the lower level brain areas.

However, if we interrupt the person’s gaze, and observe only the ﬁring rates that

result from the ﬁrst 100ms of mostly feed-forward activation, then IT proves to be

very similar to a convolutional network. Convolutional networks can predict IT

ﬁring rates, and also perform very similarly to (time limited) humans on object

recognition tasks (DiCarlo, 2013).

That being said, there are many diﬀerences between convolutional networks

and the mammalian vision system. Some of these diﬀerences are well known

to computational neuroscientists, but outside the scope of this book. Some of

these diﬀerences are not yet known, because many basic questions about how the

mamalian vision system works. As a brief list:

• The human eye is mostly very low resolution, except for a tiny patch called

the fovea. The fovea only observes an area about the size of a thumbnail

held at arms length. Though we feel as if we can see an entire scene in high

resolution, this is an illusion created by the subconscious part of our brain,

as it stitches together several glimpses of small areas. Most convolutional

networks actual receive large full resolution photographs as input.

• The human visual system is integrated with many other senses, such as

hearing, and factors like our moods and thoughts. Convolutional networks

so far are purely visual.

• The human visual system does much more than just recognize objects. It is

able to understand entire scenes including many objects and relationships

between objects, and processes rich 3-D geometric information needed for

our bodies to interface with the world. Convolutional networks have been

applied to some of these problems but these applications are in their infancy.

247

CHAPTER 9. CONVOLUTIONAL NETWORKS

• Even simple brain areas like V1 are heavily impacted by feedback from

higher levels. Feedback has been explored extensively in neural network

models but has not yet been shown to oﬀer a compelling improvement.

• While feed-forward IT ﬁring rates capture much of the same information as

convolutional network features, it’s not clear how similar the intermediate

computations are. The brain probably uses very diﬀerent activation and

pooling functions. An individual neuron’s activation probably is not well-

characterized by a single linear ﬁlter response. A recent model of V1 involves

multiple quadratic ﬁlters for each neuron (Rust et al., 2005). Indeed our

cartoon picture of “simple cells” and “complex cells” might create a non-

existent distinction; simple cells and complex cells might both be the same

kind of cell but with their “parameters” enabling a continuum of behaviors

ranging from what we call “simple” to what we call “complex.”

It’s also worth mentioning that neuroscience has told us relatively little about

how to train convolutional networks. Model structures inspired by the work of

Hubel and Wiesel date back to the Neocognitron (Fukushima, 1980) but the

Neocognitron relied on a relatively heuristic learning algorithm. Convolutional

networks did not begin to work well until they were combined with the backprop-

agation algorithm (LeCun et al., 1989), which was not inspired by any neurosci-

entiﬁc observation and is considered by some to be biologically implausible.

Some of the most striking correspondences between neuroscience and machine

learning come from visually comparing the features learned by machine learning

models with those employed by V1. TODO describe revcor, Gabor functions,

describe simple cells TODO describe complex cells and quadrature pairs TODO

all sorts of patch-based unsupervised learning actually learns gabors TODO ga-

borings TODO reference to natural image statistics

9.9 Convolutional Networks and the History of Deep

Learning

Convolutional networks have played an important role in the history of deep

learning. They are a key example of a successful application of insights obtained

by studying the brain to machine learning applications. They were also some of

the ﬁrst deep models to perform well, long before arbitrary deep models were

considered viable. Convolutional networks were also some of the ﬁrst neural

networks to solve important commercial applications and remain at the forefront

of commercial applications of deep learning today. TODO conv nets were some

of ﬁrst working deep backprop nets

248

CHAPTER 9. CONVOLUTIONAL NETWORKS

It is not entirely clear why convolutional networks succeeded when general

backpropagation networks were considered to have failed. It may simply be that

convolutional networks were more computationally eﬃcient than fully connected

networks, so it was easier to run multiple experiments with them and tune their

implementation and hyperparameters. Larger networks also seem to be easier

to train. With modern hardware, fully connected networks appear to perform

reasonably on many tasks, even when using datasets that were available and

activation functions that were popular during the times when fully connected

networks were believed not to work well. It may be that the primary barriers

to the success of neural networks were psychological. Whatever the case, it is

fortunate that convolutional networks performed well and paved the way to the

acceptance of neural networks in general.

TODO early commercial applications (possibly just ref to applications chap-

ter) TODO contests won with conv nets TODO current commerical applications

(possibly just ref to applications chapter)

TODO: ﬁgure out where in the book to put: TODOsectionPooling and Linear-

by-Part Non-Linearities: Drednets and Maxout

249