
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
gradient decomposition teaches us as well is the division of the total gradient into
(1) a term due to the numerator (the e
y
) and dependent on the actually observed
target y and (2) a term independent of y but which corresponds to the gradient of
the softmax denominator. The same principles and the role of the normalization
constant (or “partition function”) can be seen at play in the training of Markov
Random Fields, Boltzmann machines and RBMs, in Chapter 13.
The softmax has other interesting properties. First of all, the gradient of
logp(y = i | x) with respect to a only saturates in the case when p(y = i | x) is
already nearly maximal, i.e., approaching 1. Specifically, let us consider the case
where the correct label is i, i.e. y = i. The element of the gradient associated
with an erroneous label, say j 6= i, is
∂
∂a
j
L
NLL
(p, y) = p
j
. (6.6)
So if the model correctly predicts a low probability that the y = j, i.e. that
p
j
≈ 0, then the gradient is also close to zero. But if the model incorrectly
and confidently predicts that j is the correct class, i.e., p
j
≈ 1, there will be a
strong push to reduce a
j
. Conversely, if the model incorrectly and confidently
predicts that the correct class y should have a low probability, i.e., p
y
≈ 0, there
will be a strong push (a gradient of about -1) to push a
y
up. One way to see
these is to imagine doing gradient descent on the a
j
’s themselves (that is what
backprop is really based on): the update on a
j
would be proportional to minus one
times the gradient on a
j
, so a positive gradient on a
j
(e.g., incorrectly confident
that p
j
≈ 1) pushes a
j
down, while a negative gradient on a
j
(e.g., incorrectly
confident that p
y
≈ 0) pushes a
y
up. In fact note how a
y
is always pushed up
because p
y
− 1
y=y
= p
y
− 1 < 0, and the other scores a
j
(for j 6= y) are always
pushed down, because their gradient is p
j
> 0.
There are other loss functions such as the squared error applied to softmax (or
sigmoid) outputs (which was popular in the 80’s and 90’s) which have vanishing
gradient when an output unit saturates (when the derivative of the non-linearity
is near 0), even if the output is completely wrong (Solla et al., 1988). This may be
a problem because it means that the parameters will basically not change, even
though the output is wrong.
To see how the squared error interacts with the softmax output, we need to
introduce a one-hot encoding of the label, y = e
i
= [0, . . . , 0, 1, 0, . . . , 0], i.e for
the label y = i, we have y
i
= 1 and y
j
= 0, ∀j 6= i. We will again consider that
we have the output of the network to be p = softmax(a), where, as before, a is
the input to the softmax function ( e.g. a = b + W h with h the output of the
last hidden layer).
For the squared error loss L
2
(p(a), y) = ||p(a) − y||
2
, the gradient of the loss
168