Chapter 4
Numerical Computation
Machine learning algorithms usually require a high amount of numerical computation.
This refers to algorithms that solve mathematical problems by methods that iteratively
update estimates of the solution, rather than analytically deriving a symbolic expression
for the correct solution. Common operations include solving systems of linear equations
and finding the value of an argument that minimizes a function. Even just evaluating a
mathematical function on a digital computer can be difficult when the function involves
real numbers, which cannot be represented precisely using a finite amount of memory.
4.1 Overflow and underflow
The fundamental difficulty in performing continuous math on a digital computer is that
we need to represent infinitely many real numbers with a finite number of bit patterns.
This means that for almost all real numbers, we incur some approximation error when
we represent the number in the computer. In many cases, this is just rounding error.
Rounding error is problematic, especially when it compounds across many operations,
and can cause algorithms that work in theory to fail in practice if they are not designed
to minimize the accumulation of rounding error.
One form of rounding error that is particularly devastating is underflow. Underflow
occurs when numbers near zero are rounded to zero. If the result is later used as the
denominator of a fraction, the computer encounters a divide-by-zero error rather than
performing a valid floating point operation.
Another highly damaging form of numerical error is overflow. Overflow occurs when
numbers with large magnitude are approximated as or −∞. No more valid arithmetic
can be done at this point.
For an example of the need to design software implementations to deal with overflow
and underflow, consider the softmax function:
softmax(x)
i
=
exp(x
i
)
n
j
exp(x
j
)
.
Consider what happens when all of the x
i
are equal to some constant c. Analytically,
we can see that all of the outputs should be equal to
1
n
. Numerically, this may not occur
54
when c has large magnitude. If c is very negative, then exp(c) will underflow. This means
the denominator of the softmax will become 0, so the final result is undefined. When
c is very large and positive, exp(c) will overflow, again resulting in the expression as a
whole being undefined. Both of these difficulties can be resolved by instead evaluating
softmax(z) where z = x ||x||
.
For the most part, we do not explicitly detail all of the numerical considerations
involved in implementing the various algorithms described in this book. Implementors
should keep numerical issues in mind when developing implementations. Many numerical
issues can be avoided by using Theano (Bergstra et al., 2010; Bastien et al., 2012), a
software package that automatically detects and stabilizes many common numerically
unstable expressions that arise in the context of deep learning.
4.2 Poor conditioning
Conditioning refers to how rapidly a function changes with respect to small changes
in its inputs. Functions that change rapidly when their inputs are perturbed slightly
can be problematic for scientific computation because rounding errors in the inputs can
result in large changes in the output.
Consider the function f(x) = A
1
x. When A R
n×n
has an eigenvalue decompo-
sition, its condition number is
max
i,j
|
λ
i
λ
j
|,
i.e. the ratio of the magnitude of the largest and smallest eigenvalue. When this number
is large, matrix inversion is particularly sensitive to error in the input.
Note that this is an intrinsic property of the matrix itself, not the result of rounding
error during matrix inversion. Poorly conditioned matrices amplify pre-existing errors
when we multiply by the true matrix inverse. In practice, the error will be compounded
further by numerical errors in the inversion process itself.
4.3 Gradient-Based Optimization
Most deep learning algorithms involve optimization of some sort. Optimization refers
to the task of either minimizing or maximizing some function f(x) by altering x. We
usually phrase most optimization problems in terms of minimizing f (x). Maximization
may be accomplished via a minimization algorithm by minimizing f(x).
The function we want to minimize or maximize is called the objective function.
When we are minimizing it, we may also call it the cost function, loss function, or error
function.
We often denote the value that minimizes or maximizes a function with a superscript
. For example, we might say x
= arg min f(x).
We assume the reader is already familiar with calculus, but provide a brief review
of how calculus concepts relate to optimization here.
55
Figure 4.1: An illustration of how the derivatives of a function can be used to follow the
function downhill to a minimum. This technique is called gradient descent.
The derivative of a function y = f(x) from R to R, denoted as f
(x),
dy
dx
or f
(x),
gives the slope of f(x) at the point x. It quantifies how a small change in x gets scaled
in order to obtain a corresponding change in y. The derivative is therefore useful for
minimizing a function because it tells us how to change x in order to make a small
improvement in y. For example, we know that f(x sign(f
(x))) is less than f(x) for
small enough . We can thus reduce f (x) by moving x in small steps with opposite sign
of the the derivative. See Fig. 4.1 for an example of this technique.
When f
(x) = 0, the derivative provides no information about which direction to
move. Points where f
(x) = 0 are known as critical points or stationary points. A local
minimum is a point where f(x) is lower than at all neighboring points, so it is no longer
possible to decrease f(x) by making infinitesimal steps. A local maximum is a point
where f (x) is higher than at all neighboring points, so it is not possible to increase f(x)
by making infinitesimal steps. Some critical points are neither maxima nor minima.
These are known as saddle points. See Fig. 4.2 for examples of each type of critical
point.
A point that obtains the absolute lowest value of f(x) is a global minimum. It
56
Figure 4.2: Examples of each of the three types of critical points in 1-D. A critical point
is a point with zero slope. Such a point can either be a local minimum, which is lower
than the neighboring points, a local maximum, which is higher than the neighboring
points, or a saddle point, which has neighbors that are both higher and lower than the
point itself. The situation in higher dimension is qualitatively different, especially for
saddle points: see Figures 4.4 and 4.5.
57
is possible for there to be only one global minimum or multiple global minima of the
function. It is also possible for there to be local minima that are not globally optimal. In
the context of deep learning, we optimize functions that may have many local minima
that are not optimal, and many saddle points surrounded by very flat regions. All
of this makes optimization very difficult, especially when the input to the function is
multidimensional. We therefore usually settle for finding a value of f that is very low,
but not necessarily minimal in any formal sense. See Fig. 4.3 for an example. The figure
caption also raises the question of whether local minima or saddle points and plateaus
are more to blame for the difficulties one may encounter in training deep networks, a
question that is discussed further in Chapter 8, in particular Section 8.4.
We often minimize functions that have multiple inputs: f : R
n
R. Note that for
the concept of “minimization” to make sense, there must still be only one output.
For these functions, we must make use of the concept of partial derivatives. The
partial derivative
x
i
f(x) measures how f changes as only the variable x
i
increases at
point x. The gradient of f is the vector containing all of the partial derivatives, denoted
x
f(x). Element i of the gradient is the partial derivative of f with respect to x
i
. In
multiple dimensions, critical points are points where every element of the gradient is
equal to zero.
The directional derivative in direction u (a unit vector) is the slope of the function
f in direction u. In other words, the derivative of the function f(x + αu) with respect
to α. Using the chain rule, we can see that this u
x
f(x).
To minimize f, we would like to find the direction in which f decreases the fastest.
We can do this using the directional derivative:
min
u,u
u=1
u
x
f(x)
= min
u,u
u=1
||u||
2
||∇
x
f(x)||
2
cos θ
where θ is the angle between u and the gradient. Substituting in ||u||
2
= 1 and ignoring
factors that don’t depend on u, this simplifies to min
u
cos θ. This is minimized when
u points in the opposite direction as the gradient. In other words, the gradient points
directly uphill, and the negative gradient points directly downhill. We can decrease f
by moving in the direction of the negative gradient. This is known as the method of
steepest descent or gradient descent.
Steepest descent proposes a new point
x
= x
x
f(x)
where is the size of the step. We can choose in several different ways. A pop-
ular approach is to set to a small constant. Sometimes, we can solve for the step
size that makes the directional derivative vanish. Another approach is to evaluate
f (x
x
f(x)) for several values of and choose the one that results in the small-
est objective function value. This last strategy is called a line search.
Steepest descent converges when every element of the gradient is zero (or, in practice,
very close to zero). In some cases, we may be able to avoid running this iterative
58
Figure 4.3: Optimization algorithms may fail to find a global minimum when there
are multiple local minima or plateaus present. In the context of deep learning, we
generally accept such solutions even though they are not truly minimal, so long as
they correspond to significantly low values of the cost function. Optimizing MLPs was
believed to suffer from the presence of many local minima, but this idea is questioned
in recent work (Dauphin et al., 2014), with saddle points being considered as the more
serious issue.
59
algorithm, and just jump directly to the critical point by solving the equation
x
f(x) =
0 for x.
Sometimes we need to find all of the partial derivatives of all of the elements of a
vector-valued function. The matrix containing all such partial derivatives is known as
a Jacobian matrix. Specifically, if we have a function f : R
m
R
n
, then the Jacobian
matrix J R
n×m
of f is defined such that J
i,j
=
x
j
f(x)
i
.
We are also sometimes interested in a derivative of a derivative. This is known
as a second derivative. For example,
2
x
i
x
j
f is the derivative with respect to x
i
of the
derivative of f with respect to x
j
. Note that the order of derivativation can be swapped,
so that
2
x
i
x
j
f =
2
x
j
x
i
f. In a single dimension, we can denote
d
2
dx
2
f by f

(x).
The second derivative tells us how the first derivative will change as we vary the
input. This means it can be useful for determining whether a critical point is a local
maximum, a local minimum, or saddle point. Recall that on a critical point, f
(x) = 0.
When f

(x) > 0, this means that f
(x) increases as we move to the right, and f
(x)
decreases as we move to the left. This means f
(x ) < 0 and f
(x + ) > 0 for small
enough . In other words, as we move right, the slope begins to point uphill to the right,
and as we move left, the slope begins to point uphill to the left. Thus, when f
(x) = 0
and f

(x) > 0, we can conclude that x is a local minimum. Similarly, when f
(x) = 0
and f

(x) < 0, we can conclude that x is a local maximum. This is known as the second
derivative test. Unfortunately, when f

(x) = 0, the test is inconclusive. In this case x
may be a saddle point, or a part of a flat region.
In multiple dimensions, we need to examine all of the second derivatives of the
function. These derivatives can be collected together into a matrix called the Hessian
matrix. The Hessian matrix H(f)(x) is defined such that
H(f )(x)
i,j
=
2
x
i
x
j
f(x).
Equivalently, the Hessian is the Jacobian of the gradient. Using the Hessian matrix,
we can generalize the second derivative test to multiple dimensions. At a critical point,
where
x
f(x) = 0, we can examine the eigenvalues of the Hessian to determine whether
the critical point is a local maximum, local minimum, or saddle point. When the Hessian
is positive definite
1
, the point is a local minimum. This can be seen by observing that the
directional second derivative in any direction must be positive, and making reference to
the univariate second derivative test. Likewise, when the Hessian is negative definite
2
,
the point is a local maximum. In multiple dimensions, it is actually possible to find
positive evidence of saddle points in some cases. When at least one eigenvalue is positive
and at least one eigenvalue is negative, we know that x is a local maximum on one cross
section of f but a local minimum on another cross section. See Fig. 4.4 for an example.
Finally, the multidimensional second derivative test can be inconclusive, just like the
univariate version. The test is inconclusive whenever all of the non-zero eigenvalues have
1
all its eigenvalues are positive
2
all its eigenvalues are negative
60
the same sign, but at least one eigenvalue is zero. This is because the univariate second
derivative test is inconclusive in the cross section corresponding to the zero eigenvalue.
The Hessian can also be useful for understanding the performance of gradient descent.
When the Hessian has a poor condition number, gradient descent performs poorly. This
is because in one direction, the derivative increases rapidly, while in another direction, it
increases slowly. Gradient descent is unaware of this change in the derivative so it does
not know that it needs to explore preferentially in the direction where the derivative
remains negative for longer. See Fig. 4.5 for an example.
This issue can be resolved by using information from the Hessian matrix to guide
the search. The simplest method for doing so is known as Newton’s method. Newton’s
method is based on using a second-order Taylor series expansion to approximate f(x)
near some point x
0
:
f(x) = f(x
0
) + (x x
0
)
x
f(x
0
) +
1
2
(x x
0
)
H(f )(x
0
)(x x
0
).
If we then solve for the critical point of this function, we obtain:
x
= x
0
H(f)(x
0
)
1
x
f(x
0
).
When the function can be locally approximated as quadratic, iteratively updating the
approximation and jumping to the minimum of the approximation can reach the critical
point much faster than gradient descent would. This is a useful property near a local
minimum, but it can be a harmful property near a saddle point that gradient descent
may be fortunate enough to avoid.
Optimization algorithms such as gradient descent that use only the gradient are
called first-order optimization algorithms. Optimization algorithms such as Newton’s
method that also use the Hessian matrix are called second-order optimization algorithms.
The optimization algorithms employed in most contexts in this book are applicable
to a wide variety of functions, but come with almost no guarantees. This is because the
family of functions used in deep learning is quite complicated and complex. In many
other fields, the dominant approach to optimization is to design optimization algorithms
for a limited family of functions. Perhaps the most successful field of specialized op-
timization is convex optimization. Convex optimization algorithms are able to provide
many more guarantees, are applicable only to functions for which the Hessian is positive
definite everywhere. Such functions are well-behaved because they lack saddle points
and all of their local minima are necessarily global minima. However, most problems
in deep learning are difficult to express in terms of convex optimization. Convex opti-
mization is used only as a subroutine of some deep learning algorithms. Ideas from the
analysis of convex optimization algorithms can be useful for proving the convergence of
deep learning algorithms. However, in general, the importance of convex optimization is
greatly diminished in the context of deep learning. For more information about convex
optimization, see Boyd and Vandenberghe (2004). .
61
Figure 4.4: A saddle point containing both positive and negative curvature. The func-
tion in this example is f(x) = x
2
0
x
2
1
. Along axis 0, the function curves upward (positive
eigenvalue), but along axis 1, the function curves downward (negative eigenvalue). In-
deed, the name “saddle point” derives from the saddle-like shape of this function. This is
the quintessential example of a function with a saddle point. Note how in more than one
dimension, it is not necessary to have an eigenvalue of 0 in order to get a saddle point:
it is only necessary to have both positive and negative eigenvalues. See Section 8.4 for
a longer discussion of saddle points in deep nets.
62
Figure 4.5: Gradient descent fails to exploit the curvature information contained in the
Hessian matrix. Here we use gradient descent on a quadratic function whose Hessian
matrix has condition number 5 (curvature is 5 times larger in one direction than in some
other direction). The red lines indicate the path followed by gradient descent. This very
elongated quadratic function resembles a long canyon. Gradient descent wastes time
repeatedly descending canyon walls, because they are the steepest feature. Because
the step size is somewhat too large, it has a tendency to overshoot the bottom of the
function and thus needs to descend the opposite canyon wall on the next iteration.
The large positive eigenvalue of the Hessian corresponding to the eigenvector pointed
in this direction indicates that this directional derivative is rapidly increasing, so an
optimization algorithm based on the Hessian could predict that the steepest direction is
not actually a promising search direction in this context. Note that some recent results
suggest that the above picture is not representative for deep highly non-linear networks.
See Section 8.5 for more on this subject.
63
4.4 Constrained optimization
Sometimes we wish not only to maximize or minimize a function f(x) over all possible
values of x. Instead we may wish to find the maximal or minimal value of f(x) for
values of x in some set S. This is known as constrained optimization. Points x that lie
within the set S are called feasible points in constrained optimization terminology.
One simple approach to constrained optimization is simply to modify gradient de-
scent to be aware of the constraint. If we use a small constant step size , we can make
gradient descent steps, then project the result back into S. If we use a line search, we
can search only over step sizes that yield new x points that are feasible.
A more sophisticated approach is to design a different, unconstrained optimization
problem whose solution can be converted into a solution to the original, constrained
optimization problem. For example, if we want to minimize f(x) for x R
2
with x con-
strained to have exactly unit l
2
norm, we can instead minimize g(θ) = f([cos θ, sin θ]
T
)
with respect to θ, then return [cos θ, sin θ] as the solution to the original problem. This
approach requires creativity; the transformation between optimization problems must
be designed specifically for each case we encounter.
Lagrange multipliers provide a very general solution to constrained optimization.
Using the method of Lagrange multipliers, we introduce a new function called the La-
grangian or Lagrange function. This function introduces new, extra variables, called
Lagrange multipliers.
To define the Lagrangian, we first need to describe S in terms of equations and
inequalities. We want a description of S in terms of m functions g
i
and n functions h
j
so that S = {x | i, g
i
(x) = 0 and j, h
j
(x) 0}. The equations involving g
i
are called
the equality constraints and the inequalities involving h
j
are called inequality constraints.
We introduce new variables λ
i
and α
j
for each constraint. The Lagrangian is then
defined as
L(x, λ, α) = f(x) +
i
λ
i
g
i
(x) +
j
α
j
h
j
(x).
We can now solve a constrained minimization problem using unconstrained opti-
mization of the Lagrangian. Observe that, so long as at least one feasible point exists
and f(x) is not permitted to have value , then
min
x
max
λ
max
α,α0
L(x, λ, α).
has the same optimal objective function value and set of optimal points x as
min
x
f(x).
This follows because any time the constraints are satisfied,
max
λ
max
α,α0
L(x, λ, α) = f(x),
while any time a constraint is violated,
max
λ
max
α,α0
L(x, λ, α) = .
64
These properties guarantee that no infeasible point will ever be optimal, and that the
optimum within the feasible points is unchanged.
To perform constrained maximization, we can construct the Lagrange function of
f(x), which leads to this optimization problem:
min
x
max
λ
max
α,α0
f(x) +
i
λ
i
g
i
(x) +
j
α
j
h
j
(x).
We may also convert this to a problem with maximization in the outer loop:
max
x
min
λ
min
α,α0
f(x) +
i
λ
i
g
i
(x)
j
α
j
h
j
(x).
Note that the sign of the term for the inequality constraints does not matter; we may
define it with addition or subtraction as we wish, because the optimization over Lambda
is free to choose any sign of each λ
i
.
The inequality constraints are particularly interesting. We say that a constraint
h
i
(x) is active if h
i
(x
) = 0. If a constraint is not active, then the solution to the
problem is the same whether or not that constraint exists. Because an inactive h
i
has
negative value, then the solution to min
x
max
λ
max
α,α0
L(x, λ, α) will have α
i
= 0.
We can thus observe that at the solution, αh(x) = 0. In other words, for all i, we
know that at least one of the constraints α
i
0 and h
i
(x) 0 must be active at the
solution. To gain some intuition for this idea, we can say that either the solution is
on the boundary imposed by the inequality and we must use its Lagrange multiplier to
influence the solution to x, or the inequality has no influence on the solution and we
represent this by zeroing out its Lagrange multiplier.
The properties that the gradient of the Lagrangian is zero, all constraints on both
x and the Lagrange multipliers are satisfied, and α h(x) = 0 are called the Karush-
Kuhn-Tucker (KKT) conditions TODO cite Karush, cite Kuhn and Tucker. Together,
these properties describe the optimal points of constrained optimization problems.
For more information about Lagrange multipliers, see Nocedal and Wright (2006).
4.5 Example: linear least squares
Suppose we want to find the value of x that minimizes
f(x) =
1
2
||Ax b||
2
2
.
There are specialized linear algebra algorithms that can solve this problem efficiently.
However, we can also explore how to solve it using gradient-based optimization as a
simple example of how these techniques work.
First, we need to obtain the gradient:
x
f(x) = A
(Ax b) = A
Ax A
b.
65
Algorithm 4.1 An algorithm to minimize f(x) =
1
2
||Ax b||
2
2
with respect to x using
gradient descent.
Set , the step size, and δ, the tolerance, to small, positive numbers.
while ||A
Ax A
b||
2
> δ do
x x
A
Ax A
b
end while
We can then follow this gradient downhill, taking small steps. See Algorithm 4.1 for
details.
One can also solve this problem using Newton’s method. In this case, because the
true function is quadratic, the quadratic approximation employed by Newton’s method
is exact, and the algorithm converges to the global minimum in a single step.
Now suppose we wish to minimize the same function, but subject to the constraint
x
x 1. To do so, we introduce the Lagrangian
L(x, λ) = f(x) + λ
x
x 1
.
We can now solve the problem
min
x
max
λ,λ0
L(x, λ).
The solution to the unconstrained problem is given by x = A
1
b. If this point is
feasible, then it is the solution to the constrained problem. Otherwise, we must find a
solution where the constraint is active. By differentiating the Lagrangian with respect
to x, we obtain the equation
Ax b + 2λx = 0.
This tells us that the solution will take the form
x = (A + 2λI)
1
b.
The magnitude of λ must be chosen such that the result obeys the constraint. We can
find this value by performing gradient ascent on λ. To do so, observe
λ
L(x, λ) = x
x 1.
When the norm of x exceeds 1, this derivative is positive, so to ascend the gradient and
increase the Lagrangian with respect to λ, we increase λ. This will in turn shrink the
optimal x. The process continues until x has the correct norm and the derivative on λ
is 0.
66