
CHAPTER 1. INTRODUCTION
1.2.4 Increasing Accuracy, Application Complexity and Real-World
Impact
Since the 1980s, deep learning has consistently improved in its ability to provide
accurate recognition or prediction. Moreover, deep learning has consistently been
applied with success to broader and broader sets of applications.
The earliest deep models were used to recognize individual objects in tightly
cropped, extremely small images (Rumelhart et al., 1986a). Since then there
has been a gradual increase in the size of images neural networks could process.
Modern object recognition networks process rich high-resolution photographs and
do not have a requirement that the photo be cropped near the object to be rec-
ognized (Krizhevsky et al., 2012b). Similarly, the earliest networks could only
recognize two kinds of objects (or in some cases, the absence or presence of a sin-
gle kind of object), while these modern networks typically recognize at least 1,000
different categories of objects. The largest contest in object recognition is the Im-
ageNet Large-Scale Visual Recognition Competition held each year. A dramatic
moment in the meteoric rise of deep learning came when a convolutional network
won this challenge for the first time and by a wide margin, bringing down the
state-of-the-art error rate from 26.1% to 15.3% (Krizhevsky et al., 2012b). Since
then, these competitions are consistently won by deep convolutional nets, and as
of this writing, advances in deep learning had brought the latest error rate in this
contest down to 6.5% as shown in Fig. 1.11, using even deeper networks (Szegedy
et al., 2014a). Outside the framework of the contest, this error rate has now
dropped to below 5% (Ioffe and Szegedy, 2015; Wu et al., 2015).
Deep learning has also had a dramatic impact on speech recognition. After
improving throughout the 1990s, the error rates for speech recognition stagnated
starting in about 2000. The introduction of deep learning (Dahl et al., 2010;
Deng et al., 2010b; Seide et al., 2011; Hinton et al., 2012a) to speech recognition
resulted in a sudden drop of error rates by up to half! We will explore this history
in more detail in Chapter 12.3.1.
Deep networks have also had spectacular successes for pedestrian detection
and image segmentation (Sermanet et al., 2013; Farabet et al., 2013a; Cou-
prie et al., 2013) and yielded superhuman performance in traffic sign classifica-
tion (Ciresan et al., 2012).
At the same time that the scale and accuracy of deep networks has increased,
so has the complexity of the tasks that they can solve. Goodfellow et al. (2014d)
showed that neural networks could learn to output an entire sequence of characters
transcribed from an image, rather than just identifying a single object. Previously,
it was widely believed that this kind of learning required labeling of the individual
elements of the sequence (G¨ul¸cehre and Bengio, 2013). Since this time, a neural
network designed to model sequences, the Long Short-Term Memory or LSTM
22