
CHAPTER 12. APPLICATIONS
very generic neural network techniques can be successfully applied to natural
language processing. However, to achieve excellent performance and scale well
to large applications, some domain-specific strategies become important. Natural
language modeling usually forces us to use some of the many techniques that are
specialized for processing sequential data. In many case, we choose to regard
natural language as a sequence of words, rather than a sequence of individual
characters. In this case, because the total number of possible words is so large,
we are modeling an extremely high-dimensional and sparse discrete space. Several
strategies have been developed to make models of such a space efficient, both in
a computational and in a statistical sense.
12.4.1 Historical Perspective
The idea of distributed representations for symbols was introduced by Rumelhart
et al. (1986a) in one of the first explorations of back-propagation, with symbols
corresponding to the identity of family members and the neural network capturing
the family relationships between family members, e.g., with examples of the form
(Colin, Mother, Victoria). It turned out that the first layer of the neural network
learned a representation of each family member, with learned features, e.g. for
Colin, representing which family tree Colin was in, what branch of that tree he
was in, what generation he was from, etc. One can think of these learned features
as a set of attributes and the rest of the neural network computing micro-rules
relating these attributes together in order to obtain the desired predictions, e.g.,
who is the mother of Colin? A similar idea was the basis of the research on
neural language model started by Bengio et al. (2001b), where this time each
symbol represented a word in a natural languge vocabulary, and the task was to
predict the next word given a few previous ones. Instead of having a small set
of symbols, we have a vocabulary with tens or hundreds of thousands of words
(and nowadays it goes up to the million, when considering proper names and
misspellings). This raises serious computational challenges, discussed below in
Section 12.4.4. The basic idea of a neural language models and their extensions,
e.g., for machine translation, is illustrated in Figure 12.4 and a specific instance
(which was used by Bengio et al. (2001b)) is illustrated in Figure 12.4. Figure 12.4
explains the basic of idea of splitting the model into two parts, one for the word
embeddings (mapping symbols to vectors) and one for the task to be performed.
Sometimes, different maps can be used, e.g., for input words and output words,
as in Figure 12.4, or in neural machine translation models (Section 12.4.6).
Earlier work had looked at modeling sequences of characters in text using neu-
ral networks (Miikkulainen and Dyer, 1991; Schmidhuber, 1996), but it turned
out that working with word symbols worked better as a language model and, more
importantly, immediately yielded word embeddings, i.e., interpretable representa-
394