
CHAPTER 9. CONVOLUTIONAL NETWORKS
and found what has come to be called the “Halle Berry neuron”: an individual
neuron that is activated by the concept of Halle Berry. This neuron fires when
a person sees a photo of Halle Berry, a drawing of Halle Berry, or even text
containing the words “Halle Berry.” Of course, this has nothing to do with Halle
Berry herself; other neurons responded to the presence of Bill Clinton, Jennifer
Aniston, etc.
These medial temporal lobe neurons are somewhat more general than modern
convolutional networks, which would not automatically generalize to identifying
a person or object when reading its name. The closest analog to a convolutional
network’s last layer of features is a brain area called the inferotemporal cortex
(IT). When viewing an object, information flows from the retina, through the
LGN, to V1, then onward to V2, then V4, then IT. This happens within the first
100ms of glimpsing an object. If a person is allowed to continue looking at the
object for more time, then information will begin to flow backwards as the brain
uses top-down feedback to update the activations in the lower level brain areas.
However, if we interrupt the person’s gaze, and observe only the firing rates that
result from the first 100ms of mostly feed-forward activation, then IT proves to be
very similar to a convolutional network. Convolutional networks can predict IT
firing rates, and also perform very similarly to (time limited) humans on object
recognition tasks (DiCarlo, 2013).
That being said, there are many differences between convolutional networks
and the mammalian vision system. Some of these differences are well known
to computational neuroscientists, but outside the scope of this book. Some of
these differences are not yet known, because many basic questions about how the
mamalian vision system works. As a brief list:
• The human eye is mostly very low resolution, except for a tiny patch called
the fovea. The fovea only observes an area about the size of a thumbnail
held at arms length. Though we feel as if we can see an entire scene in high
resolution, this is an illusion created by the subconscious part of our brain,
as it stitches together several glimpses of small areas. Most convolutional
networks actual receive large full resolution photographs as input.
• The human visual system is integrated with many other senses, such as
hearing, and factors like our moods and thoughts. Convolutional networks
so far are purely visual.
• The human visual system does much more than just recognize objects. It is
able to understand entire scenes including many objects and relationships
between objects, and processes rich 3-D geometric information needed for
our bodies to interface with the world. Convolutional networks have been
applied to some of these problems but these applications are in their infancy.
247