Chapter 4

Numerical Computation

Machine learning algorithms usually require a high amount of numerical computation.

This refers to algorithms that solve mathematical problems by methods that iteratively

update estimates of the solution, rather than analytically deriving a symbolic expression

for the correct solution. Common operations include solving systems of linear equations

and ﬁnding the value of an argument that minimizes a function. Even just evaluating a

mathematical function on a digital computer can be diﬃcult when the function involves

real numbers, which cannot be represented precisely using a ﬁnite amount of memory.

4.1 Overﬂow and underﬂow

The fundamental diﬃculty in performing continuous math on a digital computer is that

we need to represent inﬁnitely many real numbers with a ﬁnite number of bit patterns.

This means that for almost all real numbers, we incur some approximation error when

we represent the number in the computer. In many cases, this is just rounding error.

Rounding error is problematic, especially when it compounds across many operations,

and can cause algorithms that work in theory to fail in practice if they are not designed

to minimize the accumulation of rounding error.

One form of rounding error that is particularly devastating is underﬂow. Underﬂow

occurs when numbers near zero are rounded to zero. If the result is later used as the

denominator of a fraction, the computer encounters a divide-by-zero error rather than

performing a valid ﬂoating point operation.

Another highly damaging form of numerical error is overﬂow. Overﬂow occurs when

numbers with large magnitude are approximated as ∞ or −∞. No more valid arithmetic

can be done at this point.

For an example of the need to design software implementations to deal with overﬂow

and underﬂow, consider the softmax function:

softmax(x)

exp(x

)



exp(x

)

Consider what happens when all of the x

are equal to some constant c. Analytically,

we can see that all of the outputs should be equal to

. Numerically, this may not occur

when c has large magnitude. If c is very negative, then exp(c) will underﬂow. This means

the denominator of the softmax will become 0, so the ﬁnal result is undeﬁned. When

c is very large and positive, exp(c) will overﬂow, again resulting in the expression as a

whole being undeﬁned. Both of these diﬃculties can be resolved by instead evaluating

softmax(z) where z = x −max

. Simple algebra shows that the value of the softmax

function is not changed analytically by adding or subtracting a scalar from the input

vector. Subtracting max

results in the largest argument to exp being 0, which rules

out the possibility of overﬂow. Likewise, at least one term in the denominator has a

value of 1, which rules out the possibility of underﬂow in the denominator leading to a

division by zero.

There is still one small problem. Underﬂow in the numerator can still cause the ex-

pression as a whole to evaluate to zero. This means that if we implement log softmax(x)

by ﬁrst running the softmax subroutine then passing the result to the log function, we

could erroneously obtain −∞. Instead, we must implement a separate function that

calculates log softmax in a numerically stable way. The log softmax function can be

stabilized using the same trick as we used to stabilize the softmax function.

For the most part, we do not explicitly detail all of the numerical considerations

involved in implementing the various algorithms described in this book. Implementors

should keep numerical issues in mind when developing implementations. Many numerical

issues can be avoided by using Theano (Bergstra et al., 2010; Bastien et al., 2012), a

software package that automatically detects and stabilizes many common numerically

unstable expressions that arise in the context of deep learning.

4.2 Poor conditioning

Conditioning refers to how rapidly a function changes with respect to small changes

in its inputs. Functions that change rapidly when their inputs are perturbed slightly

can be problematic for scientiﬁc computation because rounding errors in the inputs can

result in large changes in the output.

Consider the function f(x) = A

−1

x. When A ∈ R

n×n

has an eigenvalue decompo-

sition, its condition number is

max

i,j

i.e. the ratio of the magnitude of the largest and smallest eigenvalue. When this number

is large, matrix inversion is particularly sensitive to error in the input.

Note that this is an intrinsic property of the matrix itself, not the result of rounding

error during matrix inversion. Poorly conditioned matrices amplify pre-existing errors

when we multiply by the true matrix inverse. In practice, the error will be compounded

further by numerical errors in the inversion process itself.

4.3 Gradient-Based Optimization

Most deep learning algorithms involve optimization of some sort. Optimization refers

to the task of either minimizing or maximizing some function f(x) by altering x. We

usually phrase most optimization problems in terms of minimizing f (x). Maximization

may be accomplished via a minimization algorithm by minimizing −f(x).

The function we want to minimize or maximize is called the objective function.

When we are minimizing it, we may also call it the cost function, loss function, or error

function.

We often denote the value that minimizes or maximizes a function with a superscript

∗. For example, we might say x

∗

= arg min f(x).

We assume the reader is already familiar with calculus, but provide a brief review

of how calculus concepts relate to optimization here.

The derivative of a function y = f(x) from R to R, denoted as f



(x) or

, gives the

slope of f (x) at the point x. It quantiﬁes how a small change in x gets scaled in order

to obtain a corresponding change in y. The derivative is therefore useful for minimizing

a function because it tells us how to change x in order to make a small improvement

in y. For example, we know that f (x −  sign(f



(x))) is less than f(x) for small enough

. We can thus reduce f (x) by moving x in small steps with opposite sign of the the

derivative. This technique is called gradient descent (Cauchy, 1847). See Fig. 4.1 for an

example of this technique.

When f



(x) = 0, the derivative provides no information about which direction to

move. Points where f



(x) = 0 are known as critical points or stationary points. A local

minimum is a point where f(x) is lower than at all neighboring points, so it is no longer

possible to decrease f(x) by making inﬁnitesimal steps. A local maximum is a point

where f (x) is higher than at all neighboring points, so it is not possible to increase f(x)

by making inﬁnitesimal steps. Some critical points are neither maxima nor minima.

These are known as saddle points. See Fig. 4.2 for examples of each type of critical

point.

A point that obtains the absolute lowest value of f(x) is a global minimum. It

is possible for there to be only one global minimum or multiple global minima of the

function. It is also possible for there to be local minima that are not globally optimal. In

the context of deep learning, we optimize functions that may have many local minima

that are not optimal, and many saddle points surrounded by very ﬂat regions. All

of this makes optimization very diﬃcult, especially when the input to the function is

multidimensional. We therefore usually settle for ﬁnding a value of f that is very low,

but not necessarily minimal in any formal sense. See Fig. 4.3 for an example. The ﬁgure

caption also raises the question of whether local minima or saddle points and plateaus

are more to blame for the diﬃculties one may encounter in training deep networks, a

question that is discussed further in Chapter 8, in particular Section 8.1.2.

We often minimize functions that have multiple inputs: f : R

→ R. Note that for

the concept of “minimization” to make sense, there must still be only one output.

For these functions, we must make use of the concept of partial derivatives. The

partial derivative

∂

∂x

f(x) measures how f changes as only the variable x

increases at

Figure 4.1: An illustration of how the derivatives of a function can be used to follow the

function downhill to a minimum. This technique is called gradient descent.

Figure 4.2: Examples of each of the three types of critical points in 1-D. A critical point

is a point with zero slope. Such a point can either be a local minimum, which is lower

than the neighboring points, a local maximum, which is higher than the neighboring

points, or a saddle point, which has neighbors that are both higher and lower than the

point itself. The situation in higher dimension is qualitatively diﬀerent, especially for

saddle points: see Figures 4.4 and 4.5.

Figure 4.3: Optimization algorithms may fail to ﬁnd a global minimum when there

are multiple local minima or plateaus present. In the context of deep learning, we

generally accept such solutions even though they are not truly minimal, so long as

they correspond to signiﬁcantly low values of the cost function. Optimizing MLPs was

believed to suﬀer from the presence of many local minima, but this idea is questioned

in recent work (Dauphin et al., 2014), with saddle points being considered as the more

serious issue.

point x. The gradient of f is the vector containing all of the partial derivatives, denoted

∇

f(x). Element i of the gradient is the partial derivative of f with respect to x

. In

multiple dimensions, critical points are points where every element of the gradient is

equal to zero.

The directional derivative in direction u (a unit vector) is the slope of the function

f in direction u. In other words, the derivative of the function f(x + αu) with respect

to α, evaluated at α = 0. Using the chain rule, we can see that this u



∇

f(x).

To minimize f, we would like to ﬁnd the direction in which f decreases the fastest.

We can do this using the directional derivative:

min

u,u



u=1



∇

f(x)

= min

u,u



u=1

||u||

||∇

f(x)||

cos θ

where θ is the angle between u and the gradient. Substituting in ||u||

= 1 and ignoring

factors that don’t depend on u, this simpliﬁes to min

cos θ. This is minimized when

u points in the opposite direction as the gradient. In other words, the gradient points

directly uphill, and the negative gradient points directly downhill. We can decrease f

by moving in the direction of the negative gradient. This is known as the method of

steepest descent or gradient descent.

Steepest descent proposes a new point



= x − ∇

f(x)

where  is the size of the step. We can choose  in several diﬀerent ways. A pop-

ular approach is to set  to a small constant. Sometimes, we can solve for the step

size that makes the directional derivative vanish. Another approach is to evaluate

f (x −∇

f(x)) for several values of  and choose the one that results in the small-

est objective function value. This last strategy is called a line search.

Steepest descent converges when every element of the gradient is zero (or, in practice,

very close to zero). In some cases, we may be able to avoid running this iterative

algorithm, and just jump directly to the critical point by solving the equation ∇

f(x) =

0 for x.

Sometimes we need to ﬁnd all of the partial derivatives of all of the elements of a

vector-valued function. The matrix containing all such partial derivatives is known as

a Jacobian matrix. Speciﬁcally, if we have a function f : R

→ R

, then the Jacobian

matrix J ∈ R

n×m

of f is deﬁned such that J

i,j

∂

∂x

f(x)

We are also sometimes interested in a derivative of a derivative. This is known

as a second derivative. For example,

∂

∂x

f is the derivative with respect to x

of the

derivative of f with respect to x

. Note that the order of derivativation can be swapped,

so that

∂

∂x

f =

∂

∂x

f. In a single dimension, we can denote

f by f



(x).

The second derivative tells us how the ﬁrst derivative will change as we vary the

input. This means it can be useful for determining whether a critical point is a local

maximum, a local minimum, or saddle point. Recall that on a critical point, f



(x) = 0.

When f



(x) > 0, this means that f



(x) increases as we move to the right, and f



(x)

decreases as we move to the left. This means f



(x − ) < 0 and f



(x + ) > 0 for small

enough . In other words, as we move right, the slope begins to point uphill to the right,

and as we move left, the slope begins to point uphill to the left. Thus, when f



(x) = 0

and f



(x) > 0, we can conclude that x is a local minimum. Similarly, when f



(x) = 0

and f



(x) < 0, we can conclude that x is a local maximum. This is known as the second

derivative test. Unfortunately, when f



(x) = 0, the test is inconclusive. In this case x

may be a saddle point, or a part of a ﬂat region.

In multiple dimensions, we need to examine all of the second derivatives of the

function. These derivatives can be collected together into a matrix called the Hessian

matrix. The Hessian matrix H(f)(x) is deﬁned such that

H(f)(x)

i,j

∂

∂x

f(x).

Equivalently, the Hessian is the Jacobian of the gradient.

Anywhere that the second partial derivatives are continuous, the diﬀerential opera-

tors are commutative:

∂

∂x

f(x) =

∂

∂x

f(x).

This implies that h

i,j

= hj, i, so the Hessian matrix is symmetric at such points (which

includes nearly all inputs to nearly all functions we encounter in deep learning). Be-

cause the Hessian matrix is real and symmetric, we can decompose it into a set of real

eigenvalues and an orthogonal basis of eigenvectors.

Using the eigendecomposition Hessian matrix, we can generalize the second deriva-

tive test to multiple dimensions. At a critical point, where ∇

f(x) = 0, we can examine

the eigenvalues of the Hessian to determine whether the critical point is a local maxi-

mum, local minimum, or saddle point. When the Hessian is positive deﬁnite

, the point

is a local minimum. This can be seen by observing that the directional second deriva-

tive in any direction must be positive, and making reference to the univariate second

derivative test. Likewise, when the Hessian is negative deﬁnite

, the point is a local

maximum. In multiple dimensions, it is actually possible to ﬁnd positive evidence of

saddle points in some cases. When at least one eigenvalue is positive and at least one

eigenvalue is negative, we know that x is a local maximum on one cross section of f

but a local minimum on another cross section. See Fig. 4.4 for an example. Finally,

the multidimensional second derivative test can be inconclusive, just like the univariate

version. The test is inconclusive whenever all of the non-zero eigenvalues have the same

sign, but at least one eigenvalue is zero. This is because the univariate second derivative

test is inconclusive in the cross section corresponding to the zero eigenvalue.

The Hessian can also be useful for understanding the performance of gradient descent.

When the Hessian has a poor condition number, gradient descent performs poorly. This

is because in one direction, the derivative increases rapidly, while in another direction, it

all its eigenvalues are positive

all its eigenvalues are negative

Figure 4.4: A saddle point containing both positive and negative curvature. The func-

tion in this example is f(x) = x

− x

. Along axis corresponding to x

, the function

curves upward. This axis is an eigenvector of the Hessian and has a positive eigenvalue.

Along the axis corresponding to x

, the function curves downward. This direction is an

eigenvector of the Hessian with negative eigenvalue. The name “saddle point” derives

from the saddle-like shape of this function. This is the quintessential example of a func-

tion with a saddle point. Note that in more than one dimension, it is not necessary to

have an eigenvalue of 0 in order to get a saddle point: it is only necessary to have both

positive and negative eigenvalues. See chapter 8.1.2 for a longer discussion of saddle

points in deep nets.

increases slowly. Gradient descent is unaware of this change in the derivative so it does

not know that it needs to explore preferentially in the direction where the derivative

remains negative for longer. See Fig. 4.5 for an example.

This issue can be resolved by using information from the Hessian matrix to guide

the search. The simplest method for doing so is known as Newton’s method. Newton’s

method is based on using a second-order Taylor series expansion to approximate f(x)

near some point x

f(x) = f(x

) + (x − x

)



∇

f(x

) +

(x − x

)



H(f)(x

)(x − x

If we then solve for the critical point of this function, we obtain:

∗

= x

− H(f)(x

)

−1

∇

f(x

When the function can be locally approximated as quadratic, iteratively updating the

approximation and jumping to the minimum of the approximation can reach the critical

point much faster than gradient descent would. This is a useful property near a local

minimum, but it can be a harmful property near a saddle point that gradient descent

may be fortunate enough to avoid.

Optimization algorithms such as gradient descent that use only the gradient are

called ﬁrst-order optimization algorithms. Optimization algorithms such as Newton’s

method that also use the Hessian matrix are called second-order optimization algorithms

(Nocedal and Wright, 2006).

The optimization algorithms employed in most contexts in this book are applicable

to a wide variety of functions, but come with almost no guarantees. This is because the

family of functions used in deep learning is quite complicated and complex. In many

other ﬁelds, the dominant approach to optimization is to design optimization algorithms

for a limited family of functions. Perhaps the most successful ﬁeld of specialized op-

timization is convex optimization. Convex optimization algorithms are able to provide

many more guarantees, are applicable only to functions for which the Hessian is positive

deﬁnite everywhere. Such functions are well-behaved because they lack saddle points

and all of their local minima are necessarily global minima. However, most problems

in deep learning are diﬃcult to express in terms of convex optimization. Convex opti-

mization is used only as a subroutine of some deep learning algorithms. Ideas from the

analysis of convex optimization algorithms can be useful for proving the convergence of

deep learning algorithms. However, in general, the importance of convex optimization is

greatly diminished in the context of deep learning. For more information about convex

optimization, see Boyd and Vandenberghe (2004). .

4.4 Constrained optimization

Sometimes we wish not only to maximize or minimize a function f(x) over all possible

values of x. Instead we may wish to ﬁnd the maximal or minimal value of f(x) for

values of x in some set S. This is known as constrained optimization. Points x that lie

within the set S are called feasible points in constrained optimization terminology.

Figure 4.5: Gradient descent fails to exploit the curvature information contained in the

Hessian matrix. Here we use gradient descent on a quadratic function whose Hessian

matrix has condition number 5 (curvature is 5 times larger in one direction than in some

other direction). The red lines indicate the path followed by gradient descent. This very

elongated quadratic function resembles a long canyon. Gradient descent wastes time

repeatedly descending canyon walls, because they are the steepest feature. Because

the step size is somewhat too large, it has a tendency to overshoot the bottom of the

function and thus needs to descend the opposite canyon wall on the next iteration.

The large positive eigenvalue of the Hessian corresponding to the eigenvector pointed

in this direction indicates that this directional derivative is rapidly increasing, so an

optimization algorithm based on the Hessian could predict that the steepest direction is

not actually a promising search direction in this context. Note that some recent results

suggest that the above picture is not representative for deep highly non-linear networks.

See Section 8.1.3 for more on this subject.

One simple approach to constrained optimization is simply to modify gradient de-

scent to be aware of the constraint. If we use a small constant step size , we can make

gradient descent steps, then project the result back into S. If we use a line search (see

previous section), we can search only over step sizes  that yield new x points that are

feasible, or we can project each point on the line back into the constraint region.

A more sophisticated approach is to design a diﬀerent, unconstrained optimization

problem whose solution can be converted into a solution to the original, constrained

optimization problem. For example, if we want to minimize f(x) for x ∈ R

with x con-

strained to have exactly unit L

norm, we can instead minimize g(θ) = f([cos θ, sin θ]

)

with respect to θ, then return [cos θ, sin θ] as the solution to the original problem. This

approach requires creativity; the transformation between optimization problems must

be designed speciﬁcally for each case we encounter.

The Karush–Kuhn–Tucker (KKT) approach

provides a very general solution to

constrained optimization. With the KKT approach, we introduce a new function called

the generalized Lagrangian or generalized Lagrange function.

To deﬁne the Lagrangian, we ﬁrst need to describe S in terms of equations and

inequalities. We want a description of S in terms of m functions g

and n functions h

so that S = {x | ∀i, g

(x) = 0 and ∀j, h

(x) ≤ 0}. The equations involving g

are called

the equality constraints and the inequalities involving h

are called inequality constraints.

We introduce new variables λ

and α

for each constraint, these are called the KKT

multipliers. The generalized Lagrangian is then deﬁned as

L(x, λ, α) = f(x) +



(x) +



(x).

We can now solve a constrained minimization problem using unconstrained opti-

mization of the generalized Lagrangian. Observe that, so long as at least one feasible

point exists and f (x) is not permitted to have value ∞, then

min

max

α,α≥0

L(x, λ, α).

has the same optimal objective function value and set of optimal points x as

min

f(x).

This follows because any time the constraints are satisﬁed,

max

α,α≥0

L(x, λ, α) = f(x),

while any time a constraint is violated,

max

α,α≥0

L(x, λ, α) = ∞.

The KKT approach generalizes the method of Lagrange multipliers which only allows equality con-

straints

These properties guarantee that no infeasible point will ever be optimal, and that the

optimum within the feasible points is unchanged.

To perform constrained maximization, we can construct the generalized Lagrange

function of −f(x), which leads to this optimization problem:

min

max

α,α≥0

−f(x) +



(x) +



(x).

We may also convert this to a problem with maximization in the outer loop:

max

min

α,α≥0

f(x) +



(x) −



(x).

Note that the sign of the term for the inequality constraints does not matter; we may

deﬁne it with addition or subtraction as we wish, because the optimization over Lambda

is free to choose any sign of each λ

The inequality constraints are particularly interesting. We say that a constraint

(x) is active if h

∗

) = 0. If a constraint is not active, then the solution to the

problem is the same whether or not that constraint exists. Because an inactive h

has

negative value, then the solution to min

max

α,α≥0

L(x, λ, α) will have α

= 0.

We can thus observe that at the solution, αh(x) = 0. In other words, for all i, we

know that at least one of the constraints α

≥ 0 and h

(x) ≤ 0 must be active at the

solution. To gain some intuition for this idea, we can say that either the solution is on

the boundary imposed by the inequality and we must use its KKT multiplier to inﬂuence

the solution to x, or the inequality has no inﬂuence on the solution and we represent

this by zeroing out its KKT multiplier.

The properties that the gradient of the generalized Lagrangian is zero, all constraints

on both x and the KKT multipliers are satisﬁed, and α◦h(x) = 0 are called the Karush-

Kuhn-Tucker (KKT) conditions TODO cite Karush, cite Kuhn and Tucker. Together,

these properties describe the optimal points of constrained optimization problems.

In the case where there are no inequality constraints, the KKT approach simpliﬁes

to the method of Lagrange multipliers. For more information about the KKT approach,

see Nocedal and Wright (2006).

4.5 Example: linear least squares

Suppose we want to ﬁnd the value of x that minimizes

f(x) =

||Ax − b||

There are specialized linear algebra algorithms that can solve this problem eﬃciently.

However, we can also explore how to solve it using gradient-based optimization as a

simple example of how these techniques work.

First, we need to obtain the gradient:

∇

f(x) = A



(Ax − b) = A



Ax − A



We can then follow this gradient downhill, taking small steps. See Algorithm 4.1 for

details.

Algorithm 4.1 An algorithm to minimize f(x) =

||Ax −b||

with respect to x using

gradient descent.

Set , the step size, and δ, the tolerance, to small, positive numbers.

while ||A



Ax − A



b||

> δ do

x ← x − 





Ax − A





end while

One can also solve this problem using Newton’s method. In this case, because the

true function is quadratic, the quadratic approximation employed by Newton’s method

is exact, and the algorithm converges to the global minimum in a single step.

Now suppose we wish to minimize the same function, but subject to the constraint



x ≤ 1. To do so, we introduce the Lagrangian

L(x, λ) = f(x) + λ





x − 1



We can now solve the problem

min

max

λ,λ≥0

L(x, λ).

The solution to the unconstrained least squares problem is given by x = A

−1

b. If

this point is feasible, then it is the solution to the constrained problem. Otherwise, we

must ﬁnd a solution where the constraint is active. By diﬀerentiating the Lagrangian

with respect to x, we obtain the equation

Ax − b + 2λx = 0.

This tells us that the solution will take the form

x = (A + 2λI)

−1

The magnitude of λ must be chosen such that the result obeys the constraint. We can

ﬁnd this value by performing gradient ascent on λ. To do so, observe

∂

∂λ

L(x, λ) = x



x − 1.

When the norm of x exceeds 1, this derivative is positive, so to ascend the gradient and

increase the Lagrangian with respect to λ, we increase λ. This will in turn shrink the

optimal x. The process continues until x has the correct norm and the derivative on λ

is 0.