
CHAPTER 9. CONVOLUTIONAL NETWORKS
1. V1 is arranged in a spatial map. It actually has a two-dimensional struc-
ture mirroring the structure of the image in the retina. For example, light
arriving at the lower half of the retina affects only the corresponding half of
V1. Convolutional networks capture this property by having their features
defined in terms of two dimensional maps.
2. V1 contains many simple cells. A simple cell’s activity can to some extent be
characterized by a linear function of the image in a small, spatially localized
receptive field. The detector units of a convolutional network are designed
to emulate these properties of simple cells. V1 also contains many complex
cells. These cells respond to features that are similar to those detected by
simple cells, but complex cells are invariant to small shifts in the position
of the feature. This inspires the pooling units of convolutional networks.
Complex cells are also invariant to some changes in lighting that cannot
be captured simply by pooling over spatial locations. These invariances
have inspired some of the cross-channel pooling strategies in convolutional
networks, such as maxout units (Goodfellow et al., 2013a).
Though we know the most about V1, it is generally believed that the same
basic principles apply to other brain regions. In our cartoon view of the visual
system, the basic strategy of detection followed by pooling is repeatedly applied
as we move deeper into the brain. As we pass through multiple anatomical layers
of the brain, we eventually find cells that respond to some specific concept and are
invariant to many transformations of the input. These cells have been nicknamed
“grandmother cells”— the idea is that a person could have a neuron that activates
when seeing an image of their grandmother, regardless of whether she appears in
the left or right side of the image, whether the image is a close-up of her face or
zoomed out shot of her entire body, whether she is brightly lit, or in shadow, etc.
These grandmother cells have been shown to actually exist in the human brain,
in a region called the medial temporal lobe (Quiroga et al., 2005). Researchers
tested whether individual neurons would respond to photos of famous individuals,
and found what has come to be called the “Halle Berry neuron”: an individual
neuron that is activated by the concept of Halle Berry. This neuron fires when
a person sees a photo of Halle Berry, a drawing of Halle Berry, or even text
containing the words “Halle Berry.” Of course, this has nothing to do with Halle
Berry herself; other neurons responded to the presence of Bill Clinton, Jennifer
Aniston, etc.
These medial temporal lobe neurons are somewhat more general than modern
convolutional networks, which would not automatically generalize to identifying
a person or object when reading its name. The closest analog to a convolutional
network’s last layer of features is a brain area called the inferotemporal cortex
(IT). When viewing an object, information flows from the retina, through the
300