Chapter 4
Numerical Computation
Machine learning algorithms usually require a high amount of numerical computation.
This refers to algorithms that solve mathematical problems by methods that iteratively
update estimates of the solution, rather than analytically deriving a symbolic expression
for the correct solution. Common operations include solving systems of linear equations
and finding the value of an argument that minimizes a function. Even just evaluating a
mathematical function on a digital computer can be difficult when the function involves
real numbers, which cannot be represented precisely using a finite amount of memory.
4.1 Overflow and underflow
The fundamental difficulty in performing continuous math on a digital computer is that
we need to represent infinitely many real numbers with a finite number of bit patterns.
This means that for almost all real numbers, we incur some approximation error when
we represent the number in the computer. In many cases, this is just rounding error.
Rounding error is problematic, especially when it compounds across many operations,
and can cause algorithms that work in theory to fail in practice if they are not designed
to minimize the accumulation of rounding error.
One form of rounding error that is particularly devastating is underflow. Underflow
occurs when numbers near zero are rounded to zero. If the result is later used as the
denominator of a fraction, the computer encounters a divide-by-zero error rather than
performing a valid floating point operation.
Another highly damaging form of numerical error is overflow. Overflow occurs when
numbers with large magnitude are approximated as or −∞. No more valid arithmetic
can be done at this point.
For an example of the need to design software implementations to deal with overflow
and underflow, consider the softmax function:
softmax(x)
i
=
exp(x
i
)
n
j
exp(x
j
)
.
Consider what happens when all of the x
i
are equal to some constant c. Analytically,
we can see that all of the outputs should be equal to
1
n
. Numerically, this may not occur
56
when c has large magnitude. If c is very negative, then exp(c) will underflow. This means
the denominator of the softmax will become 0, so the final result is undefined. When
c is very large and positive, exp(c) will overflow, again resulting in the expression as a
whole being undefined. Both of these difficulties can be resolved by instead evaluating
softmax(z) where z = x max
i
x
i
. Simple algebra shows that the value of the softmax
function is not changed analytically by adding or subtracting a scalar from the input
vector. Subtracting max
i
x
i
results in the largest argument to exp being 0, which rules
out the possibility of overflow. Likewise, at least one term in the denominator has a
value of 1, which rules out the possibility of underflow in the denominator leading to a
division by zero.
There is still one small problem. Underflow in the numerator can still cause the ex-
pression as a whole to evaluate to zero. This means that if we implement log softmax(x)
by first running the softmax subroutine then passing the result to the log function, we
could erroneously obtain −∞. Instead, we must implement a separate function that
calculates log softmax in a numerically stable way. The log softmax function can be
stabilized using the same trick as we used to stabilize the softmax function.
For the most part, we do not explicitly detail all of the numerical considerations
involved in implementing the various algorithms described in this book. Implementors
should keep numerical issues in mind when developing implementations. Many numerical
issues can be avoided by using Theano (Bergstra et al., 2010; Bastien et al., 2012), a
software package that automatically detects and stabilizes many common numerically
unstable expressions that arise in the context of deep learning.
4.2 Poor conditioning
Conditioning refers to how rapidly a function changes with respect to small changes
in its inputs. Functions that change rapidly when their inputs are perturbed slightly
can be problematic for scientific computation because rounding errors in the inputs can
result in large changes in the output.
Consider the function f(x) = A
1
x. When A R
n×n
has an eigenvalue decompo-
sition, its condition number is
max
i,j
|
λ
i
λ
j
|,
i.e. the ratio of the magnitude of the largest and smallest eigenvalue. When this number
is large, matrix inversion is particularly sensitive to error in the input.
Note that this is an intrinsic property of the matrix itself, not the result of rounding
error during matrix inversion. Poorly conditioned matrices amplify pre-existing errors
when we multiply by the true matrix inverse. In practice, the error will be compounded
further by numerical errors in the inversion process itself.
57
4.3 Gradient-Based Optimization
Most deep learning algorithms involve optimization of some sort. Optimization refers
to the task of either minimizing or maximizing some function f(x) by altering x. We
usually phrase most optimization problems in terms of minimizing f (x). Maximization
may be accomplished via a minimization algorithm by minimizing f(x).
The function we want to minimize or maximize is called the objective function.
When we are minimizing it, we may also call it the cost function, loss function, or error
function.
We often denote the value that minimizes or maximizes a function with a superscript
. For example, we might say x
= arg min f(x).
We assume the reader is already familiar with calculus, but provide a brief review
of how calculus concepts relate to optimization here.
The derivative of a function y = f(x) from R to R, denoted as f
(x) or
dy
dx
, gives the
slope of f (x) at the point x. It quantifies how a small change in x gets scaled in order
to obtain a corresponding change in y. The derivative is therefore useful for minimizing
a function because it tells us how to change x in order to make a small improvement
in y. For example, we know that f (x sign(f
(x))) is less than f(x) for small enough
. We can thus reduce f (x) by moving x in small steps with opposite sign of the the
derivative. This technique is called gradient descent (Cauchy, 1847). See Fig. 4.1 for an
example of this technique.
When f
(x) = 0, the derivative provides no information about which direction to
move. Points where f
(x) = 0 are known as critical points or stationary points. A local
minimum is a point where f(x) is lower than at all neighboring points, so it is no longer
possible to decrease f(x) by making infinitesimal steps. A local maximum is a point
where f (x) is higher than at all neighboring points, so it is not possible to increase f(x)
by making infinitesimal steps. Some critical points are neither maxima nor minima.
These are known as saddle points. See Fig. 4.2 for examples of each type of critical
point.
A point that obtains the absolute lowest value of f(x) is a global minimum. It
is possible for there to be only one global minimum or multiple global minima of the
function. It is also possible for there to be local minima that are not globally optimal. In
the context of deep learning, we optimize functions that may have many local minima
that are not optimal, and many saddle points surrounded by very flat regions. All
of this makes optimization very difficult, especially when the input to the function is
multidimensional. We therefore usually settle for finding a value of f that is very low,
but not necessarily minimal in any formal sense. See Fig. 4.3 for an example. The figure
caption also raises the question of whether local minima or saddle points and plateaus
are more to blame for the difficulties one may encounter in training deep networks, a
question that is discussed further in Chapter 8, in particular Section 8.1.2.
We often minimize functions that have multiple inputs: f : R
n
R. Note that for
the concept of “minimization” to make sense, there must still be only one output.
For these functions, we must make use of the concept of partial derivatives. The
partial derivative
x
i
f(x) measures how f changes as only the variable x
i
increases at
58
Figure 4.1: An illustration of how the derivatives of a function can be used to follow the
function downhill to a minimum. This technique is called gradient descent.
59
Figure 4.2: Examples of each of the three types of critical points in 1-D. A critical point
is a point with zero slope. Such a point can either be a local minimum, which is lower
than the neighboring points, a local maximum, which is higher than the neighboring
points, or a saddle point, which has neighbors that are both higher and lower than the
point itself. The situation in higher dimension is qualitatively different, especially for
saddle points: see Figures 4.4 and 4.5.
60
Figure 4.3: Optimization algorithms may fail to find a global minimum when there
are multiple local minima or plateaus present. In the context of deep learning, we
generally accept such solutions even though they are not truly minimal, so long as
they correspond to significantly low values of the cost function. Optimizing MLPs was
believed to suffer from the presence of many local minima, but this idea is questioned
in recent work (Dauphin et al., 2014), with saddle points being considered as the more
serious issue.
61
point x. The gradient of f is the vector containing all of the partial derivatives, denoted
x
f(x). Element i of the gradient is the partial derivative of f with respect to x
i
. In
multiple dimensions, critical points are points where every element of the gradient is
equal to zero.
The directional derivative in direction u (a unit vector) is the slope of the function
f in direction u. In other words, the derivative of the function f(x + αu) with respect
to α, evaluated at α = 0. Using the chain rule, we can see that this u
x
f(x).
To minimize f, we would like to find the direction in which f decreases the fastest.
We can do this using the directional derivative:
min
u,u
u=1
u
x
f(x)
= min
u,u
u=1
||u||
2
||∇
x
f(x)||
2
cos θ
where θ is the angle between u and the gradient. Substituting in ||u||
2
= 1 and ignoring
factors that don’t depend on u, this simplifies to min
u
cos θ. This is minimized when
u points in the opposite direction as the gradient. In other words, the gradient points
directly uphill, and the negative gradient points directly downhill. We can decrease f
by moving in the direction of the negative gradient. This is known as the method of
steepest descent or gradient descent.
Steepest descent proposes a new point
x
= x
x
f(x)
where is the size of the step. We can choose in several different ways. A pop-
ular approach is to set to a small constant. Sometimes, we can solve for the step
size that makes the directional derivative vanish. Another approach is to evaluate
f (x
x
f(x)) for several values of and choose the one that results in the small-
est objective function value. This last strategy is called a line search.
Steepest descent converges when every element of the gradient is zero (or, in practice,
very close to zero). In some cases, we may be able to avoid running this iterative
algorithm, and just jump directly to the critical point by solving the equation
x
f(x) =
0 for x.
Sometimes we need to find all of the partial derivatives of all of the elements of a
vector-valued function. The matrix containing all such partial derivatives is known as
a Jacobian matrix. Specifically, if we have a function f : R
m
R
n
, then the Jacobian
matrix J R
n×m
of f is defined such that J
i,j
=
x
j
f(x)
i
.
We are also sometimes interested in a derivative of a derivative. This is known
as a second derivative. For example,
2
x
i
x
j
f is the derivative with respect to x
i
of the
derivative of f with respect to x
j
. Note that the order of derivativation can be swapped,
so that
2
x
i
x
j
f =
2
x
j
x
i
f. In a single dimension, we can denote
d
2
dx
2
f by f

(x).
The second derivative tells us how the first derivative will change as we vary the
input. This means it can be useful for determining whether a critical point is a local
maximum, a local minimum, or saddle point. Recall that on a critical point, f
(x) = 0.
62
When f

(x) > 0, this means that f
(x) increases as we move to the right, and f
(x)
decreases as we move to the left. This means f
(x ) < 0 and f
(x + ) > 0 for small
enough . In other words, as we move right, the slope begins to point uphill to the right,
and as we move left, the slope begins to point uphill to the left. Thus, when f
(x) = 0
and f

(x) > 0, we can conclude that x is a local minimum. Similarly, when f
(x) = 0
and f

(x) < 0, we can conclude that x is a local maximum. This is known as the second
derivative test. Unfortunately, when f

(x) = 0, the test is inconclusive. In this case x
may be a saddle point, or a part of a flat region.
In multiple dimensions, we need to examine all of the second derivatives of the
function. These derivatives can be collected together into a matrix called the Hessian
matrix. The Hessian matrix H(f)(x) is defined such that
H(f)(x)
i,j
=
2
x
i
x
j
f(x).
Equivalently, the Hessian is the Jacobian of the gradient.
Anywhere that the second partial derivatives are continuous, the differential opera-
tors are commutative:
2
x
i
x
j
f(x) =
2
x
j
x
i
f(x).
This implies that h
i,j
= hj, i, so the Hessian matrix is symmetric at such points (which
includes nearly all inputs to nearly all functions we encounter in deep learning). Be-
cause the Hessian matrix is real and symmetric, we can decompose it into a set of real
eigenvalues and an orthogonal basis of eigenvectors.
Using the eigendecomposition Hessian matrix, we can generalize the second deriva-
tive test to multiple dimensions. At a critical point, where
x
f(x) = 0, we can examine
the eigenvalues of the Hessian to determine whether the critical point is a local maxi-
mum, local minimum, or saddle point. When the Hessian is positive definite
1
, the point
is a local minimum. This can be seen by observing that the directional second deriva-
tive in any direction must be positive, and making reference to the univariate second
derivative test. Likewise, when the Hessian is negative definite
2
, the point is a local
maximum. In multiple dimensions, it is actually possible to find positive evidence of
saddle points in some cases. When at least one eigenvalue is positive and at least one
eigenvalue is negative, we know that x is a local maximum on one cross section of f
but a local minimum on another cross section. See Fig. 4.4 for an example. Finally,
the multidimensional second derivative test can be inconclusive, just like the univariate
version. The test is inconclusive whenever all of the non-zero eigenvalues have the same
sign, but at least one eigenvalue is zero. This is because the univariate second derivative
test is inconclusive in the cross section corresponding to the zero eigenvalue.
The Hessian can also be useful for understanding the performance of gradient descent.
When the Hessian has a poor condition number, gradient descent performs poorly. This
is because in one direction, the derivative increases rapidly, while in another direction, it
1
all its eigenvalues are positive
2
all its eigenvalues are negative
63
Figure 4.4: A saddle point containing both positive and negative curvature. The func-
tion in this example is f(x) = x
2
1
x
2
2
. Along axis corresponding to x
1
, the function
curves upward. This axis is an eigenvector of the Hessian and has a positive eigenvalue.
Along the axis corresponding to x
2
, the function curves downward. This direction is an
eigenvector of the Hessian with negative eigenvalue. The name “saddle point” derives
from the saddle-like shape of this function. This is the quintessential example of a func-
tion with a saddle point. Note that in more than one dimension, it is not necessary to
have an eigenvalue of 0 in order to get a saddle point: it is only necessary to have both
positive and negative eigenvalues. See chapter 8.1.2 for a longer discussion of saddle
points in deep nets.
64
increases slowly. Gradient descent is unaware of this change in the derivative so it does
not know that it needs to explore preferentially in the direction where the derivative
remains negative for longer. See Fig. 4.5 for an example.
This issue can be resolved by using information from the Hessian matrix to guide
the search. The simplest method for doing so is known as Newton’s method. Newton’s
method is based on using a second-order Taylor series expansion to approximate f(x)
near some point x
0
:
f(x) = f(x
0
) + (x x
0
)
x
f(x
0
) +
1
2
(x x
0
)
H(f)(x
0
)(x x
0
).
If we then solve for the critical point of this function, we obtain:
x
= x
0
H(f)(x
0
)
1
x
f(x
0
).
When the function can be locally approximated as quadratic, iteratively updating the
approximation and jumping to the minimum of the approximation can reach the critical
point much faster than gradient descent would. This is a useful property near a local
minimum, but it can be a harmful property near a saddle point that gradient descent
may be fortunate enough to avoid.
Optimization algorithms such as gradient descent that use only the gradient are
called first-order optimization algorithms. Optimization algorithms such as Newton’s
method that also use the Hessian matrix are called second-order optimization algorithms
(Nocedal and Wright, 2006).
The optimization algorithms employed in most contexts in this book are applicable
to a wide variety of functions, but come with almost no guarantees. This is because the
family of functions used in deep learning is quite complicated and complex. In many
other fields, the dominant approach to optimization is to design optimization algorithms
for a limited family of functions. Perhaps the most successful field of specialized op-
timization is convex optimization. Convex optimization algorithms are able to provide
many more guarantees, are applicable only to functions for which the Hessian is positive
definite everywhere. Such functions are well-behaved because they lack saddle points
and all of their local minima are necessarily global minima. However, most problems
in deep learning are difficult to express in terms of convex optimization. Convex opti-
mization is used only as a subroutine of some deep learning algorithms. Ideas from the
analysis of convex optimization algorithms can be useful for proving the convergence of
deep learning algorithms. However, in general, the importance of convex optimization is
greatly diminished in the context of deep learning. For more information about convex
optimization, see Boyd and Vandenberghe (2004). .
4.4 Constrained optimization
Sometimes we wish not only to maximize or minimize a function f(x) over all possible
values of x. Instead we may wish to find the maximal or minimal value of f(x) for
values of x in some set S. This is known as constrained optimization. Points x that lie
within the set S are called feasible points in constrained optimization terminology.
65
Figure 4.5: Gradient descent fails to exploit the curvature information contained in the
Hessian matrix. Here we use gradient descent on a quadratic function whose Hessian
matrix has condition number 5 (curvature is 5 times larger in one direction than in some
other direction). The red lines indicate the path followed by gradient descent. This very
elongated quadratic function resembles a long canyon. Gradient descent wastes time
repeatedly descending canyon walls, because they are the steepest feature. Because
the step size is somewhat too large, it has a tendency to overshoot the bottom of the
function and thus needs to descend the opposite canyon wall on the next iteration.
The large positive eigenvalue of the Hessian corresponding to the eigenvector pointed
in this direction indicates that this directional derivative is rapidly increasing, so an
optimization algorithm based on the Hessian could predict that the steepest direction is
not actually a promising search direction in this context. Note that some recent results
suggest that the above picture is not representative for deep highly non-linear networks.
See Section 8.1.3 for more on this subject.
66
One simple approach to constrained optimization is simply to modify gradient de-
scent to be aware of the constraint. If we use a small constant step size , we can make
gradient descent steps, then project the result back into S. If we use a line search (see
previous section), we can search only over step sizes that yield new x points that are
feasible, or we can project each point on the line back into the constraint region.
A more sophisticated approach is to design a different, unconstrained optimization
problem whose solution can be converted into a solution to the original, constrained
optimization problem. For example, if we want to minimize f(x) for x R
2
with x con-
strained to have exactly unit L
2
norm, we can instead minimize g(θ) = f([cos θ, sin θ]
T
)
with respect to θ, then return [cos θ, sin θ] as the solution to the original problem. This
approach requires creativity; the transformation between optimization problems must
be designed specifically for each case we encounter.
The Karush–Kuhn–Tucker (KKT) approach
3
provides a very general solution to
constrained optimization. With the KKT approach, we introduce a new function called
the generalized Lagrangian or generalized Lagrange function.
To define the Lagrangian, we first need to describe S in terms of equations and
inequalities. We want a description of S in terms of m functions g
i
and n functions h
j
so that S = {x | i, g
i
(x) = 0 and j, h
j
(x) 0}. The equations involving g
i
are called
the equality constraints and the inequalities involving h
j
are called inequality constraints.
We introduce new variables λ
i
and α
j
for each constraint, these are called the KKT
multipliers. The generalized Lagrangian is then defined as
L(x, λ, α) = f(x) +
i
λ
i
g
i
(x) +
j
α
j
h
j
(x).
We can now solve a constrained minimization problem using unconstrained opti-
mization of the generalized Lagrangian. Observe that, so long as at least one feasible
point exists and f (x) is not permitted to have value , then
min
x
max
λ
max
α,α0
L(x, λ, α).
has the same optimal objective function value and set of optimal points x as
min
x
f(x).
This follows because any time the constraints are satisfied,
max
λ
max
α,α0
L(x, λ, α) = f(x),
while any time a constraint is violated,
max
λ
max
α,α0
L(x, λ, α) = .
3
The KKT approach generalizes the method of Lagrange multipliers which only allows equality con-
straints
67
These properties guarantee that no infeasible point will ever be optimal, and that the
optimum within the feasible points is unchanged.
To perform constrained maximization, we can construct the generalized Lagrange
function of f(x), which leads to this optimization problem:
min
x
max
λ
max
α,α0
f(x) +
i
λ
i
g
i
(x) +
j
α
j
h
j
(x).
We may also convert this to a problem with maximization in the outer loop:
max
x
min
λ
min
α,α0
f(x) +
i
λ
i
g
i
(x)
j
α
j
h
j
(x).
Note that the sign of the term for the inequality constraints does not matter; we may
define it with addition or subtraction as we wish, because the optimization over Lambda
is free to choose any sign of each λ
i
.
The inequality constraints are particularly interesting. We say that a constraint
h
i
(x) is active if h
i
(x
) = 0. If a constraint is not active, then the solution to the
problem is the same whether or not that constraint exists. Because an inactive h
i
has
negative value, then the solution to min
x
max
λ
max
α,α0
L(x, λ, α) will have α
i
= 0.
We can thus observe that at the solution, αh(x) = 0. In other words, for all i, we
know that at least one of the constraints α
i
0 and h
i
(x) 0 must be active at the
solution. To gain some intuition for this idea, we can say that either the solution is on
the boundary imposed by the inequality and we must use its KKT multiplier to influence
the solution to x, or the inequality has no influence on the solution and we represent
this by zeroing out its KKT multiplier.
The properties that the gradient of the generalized Lagrangian is zero, all constraints
on both x and the KKT multipliers are satisfied, and αh(x) = 0 are called the Karush-
Kuhn-Tucker (KKT) conditions TODO cite Karush, cite Kuhn and Tucker. Together,
these properties describe the optimal points of constrained optimization problems.
In the case where there are no inequality constraints, the KKT approach simplifies
to the method of Lagrange multipliers. For more information about the KKT approach,
see Nocedal and Wright (2006).
4.5 Example: linear least squares
Suppose we want to find the value of x that minimizes
f(x) =
1
2
||Ax b||
2
2
.
There are specialized linear algebra algorithms that can solve this problem efficiently.
However, we can also explore how to solve it using gradient-based optimization as a
simple example of how these techniques work.
First, we need to obtain the gradient:
x
f(x) = A
(Ax b) = A
Ax A
b.
68
We can then follow this gradient downhill, taking small steps. See Algorithm 4.1 for
details.
Algorithm 4.1 An algorithm to minimize f(x) =
1
2
||Ax b||
2
2
with respect to x using
gradient descent.
Set , the step size, and δ, the tolerance, to small, positive numbers.
while ||A
Ax A
b||
2
> δ do
x x
A
Ax A
b
end while
One can also solve this problem using Newton’s method. In this case, because the
true function is quadratic, the quadratic approximation employed by Newton’s method
is exact, and the algorithm converges to the global minimum in a single step.
Now suppose we wish to minimize the same function, but subject to the constraint
x
x 1. To do so, we introduce the Lagrangian
L(x, λ) = f(x) + λ
x
x 1
.
We can now solve the problem
min
x
max
λ,λ0
L(x, λ).
The solution to the unconstrained least squares problem is given by x = A
1
b. If
this point is feasible, then it is the solution to the constrained problem. Otherwise, we
must find a solution where the constraint is active. By differentiating the Lagrangian
with respect to x, we obtain the equation
Ax b + 2λx = 0.
This tells us that the solution will take the form
x = (A + 2λI)
1
b.
The magnitude of λ must be chosen such that the result obeys the constraint. We can
find this value by performing gradient ascent on λ. To do so, observe
λ
L(x, λ) = x
x 1.
When the norm of x exceeds 1, this derivative is positive, so to ascend the gradient and
increase the Lagrangian with respect to λ, we increase λ. This will in turn shrink the
optimal x. The process continues until x has the correct norm and the derivative on λ
is 0.
69