convolution, if we let g be any function that translate the input, i.e., shifts it, then
the convolution function is equivariant to g. For example, define g(x) such that for
all i, g(x)[i] = x[i − 1]. This shifts every element of x one unit to the right. If we
apply this transformation to x, then apply convolution, the result will be the same as
if we applied convolution to x, then applied the transformation to the output. When
processing time series data, this means that convolution produces a sort of timeline that
shows when different features appear in the input. If we move an event later in time
in the input, the exact same representation of it will appear in the output, just later in
time. Similarly with images, convolution creates a 2-D map of where certain features
appear in the input. If we move the object in the input, its representation will move the
same amount in the output. This is useful for when we know that same local function
is useful everywhere in the input. For example, when processing images, it is useful to
detect edges in the first layer of a convolutional network, and an edge looks the same
regardless of where it appears in the image. This property is not always useful. For
example, if we want to recognize a face, some portion of the network needs to vary with
spatial location, because the top of a face does not look the same as the bottom of a
face–the part of the network processing the top of the face needs to look for eyebrows,
while the part of the network processing the bottom of the face needs to look for a chin.
Note that convolution is not equivariant to some other transformations, such as
changes in the scale or rotation of an image. Other mechanisms are necessary for
handling these kinds of transformations.
We can think of the use of convolution as introducing an infinitely strong prior
probability distribution over the parameters of a layer. This prior says that the function
the layer should learn contains only local interactions and is equivariant to translation.
This view of convolution as an infinitely strong prior makes it clear that the efficiency
improvements of convolution come with a caveat: convolution is only applicable when
the assumptions made by this prior are close to true. The use of convolution constrains
the class of functions that the layer can represent. If the function that a layer needs to
learn is indeed a local, translation invariant function, then the layer will be dramatically
more efficient if it uses convolution rather than matrix multiplication. If the necessary
function does not have these properties, then using a convolutional layer will cause the
model to have high training error.
Finally, some kinds of data cannot be processed by neural networks defined by matrix
multiplication with a fixed-shape matrix. Convolution enables processing of some of
these kinds of data. We discuss this further in section 9.5.
9.3 Pooling
A typical layer of a convolutional network consists of three stages (see Fig. 9.5). In
the first stage, the layer performs several convolutions in parallel to produce a set of
presynaptic activations. In the second stage, each presynaptic activation is run through a
nonlinear activation function, such as the rectified linear activation function. This stage
is sometimes called the detector stage. In the third stage, we use a pooling function to
178