Chapter 4

Numerical Computation

Machine learning algorithms usually require a high amount of numerical compu-

tation. This typically refers to algorithms that solve mathematical problems by

methods that iteratively update estimates of the solution, rather than analytically

deriving a formula providing a symbolic expression for the correct solution. Com-

mon operations include solving systems of linear equations and ﬁnding the value

of an argument that minimizes a function. Even just evaluating a mathematical

function on a digital computer can be diﬃcult when the function involves real

numbers, which cannot be represented precisely using a ﬁnite amount of memory.

4.1 Overﬂow and Underﬂow

The fundamental diﬃculty in performing continuous math on a digital computer

is that we need to represent inﬁnitely many real numbers with a ﬁnite number

of bit patterns. This means that for almost all real numbers, we incur some

approximation error when we represent the number in the computer. In many

cases, this is just rounding error. Rounding error is problematic, especially when

it compounds across many operations, and can cause algorithms that work in

theory to fail in practice if they are not designed to minimize the accumulation

of rounding error.

One form of rounding error that is particularly devastating is underﬂow. Un-

derﬂow occurs when numbers near zero are rounded to zero. Many functions

behave qualitatively diﬀerently when their argument is zero rather than a small

positive number. For example, we usually want to avoid division by zero (some

software environments will raise exceptions when this occurs, otherwise will re-

turn a result with a placeholder not-a-number value) or taking the logarithm of

zero (this is usually treated as −∞, which then becomes not-a-number if it is

used for further arithmetic).

CHAPTER 4. NUMERICAL COMPUTATION

Another highly damaging form of numerical error is overﬂow. Overﬂow occurs

when numbers with large magnitude are approximated as ∞ or −∞. Further

arithmetic will usually change this inﬁnite values into not-a-number values.

For an example of the need to design software implementations to deal with

overﬂow and underﬂow, consider the softmax function, typically used to predict

the probabilities associated with a Multinoulli distribution:

softmax(x)

exp(x

)

exp(x

)

Consider what happens when all of the x

are equal to some constant c. Analyt-

ically, we can see that all of the outputs should be equal to

. Numerically, this

may not occur when c has large magnitude. If c is very negative, then exp(c) will

underﬂow. This means the denominator of the softmax will become 0, so the ﬁnal

result is undeﬁned. When c is very large and positive, exp(c) will overﬂow, again

resulting in the expression as a whole being undeﬁned. Both of these diﬃculties

can be resolved by instead evaluating softmax(z) where z = x −max

. Simple

algebra shows that the value of the softmax function is not changed analytically

by adding or subtracting a scalar from the input vector. Subtracting max

results in the largest argument to exp being 0, which rules out the possibility of

overﬂow. Likewise, at least one term in the denominator has a value of 1, which

rules out the possibility of underﬂow in the denominator leading to a division by

zero.

There is still one small problem. Underﬂow in the numerator can still cause

the expression as a whole to evaluate to zero. This means that if we implement

log softmax(x) by ﬁrst running the softmax subroutine then passing the result to

the log function, we could erroneously obtain −∞. Instead, we must implement

a separate function that calculates log softmax in a numerically stable way. The

log softmax function can be stabilized using the same trick as we used to stabilize

the softmax function.

For the most part, we do not explicitly detail all of the numerical consider-

ations involved in implementing the various algorithms described in this book.

Implementors should keep numerical issues in mind when developing implemen-

tations. Many numerical issues can be avoided by using Theano (Bergstra et al.,

2010; Bastien et al., 2012), a software package that automatically detects and sta-

bilizes many common numerically unstable expressions that arise in the context

of deep learning.

4.2 Poor Conditioning

Conditioning refers to how rapidly a function changes with respect to small

changes in its inputs. Functions that change rapidly when their inputs are per-

CHAPTER 4. NUMERICAL COMPUTATION

turbed slightly can be problematic for scientiﬁc computation because rounding

errors in the inputs can result in large changes in the output.

Consider the function f(x) = A

−1

x. When A ∈ R

n×n

has an eigenvalue

decomposition, its condition number is

max

i,j

i.e. the ratio of the magnitude of the largest and smallest eigenvalue. When this

number is large, matrix inversion is particularly sensitive to error in the input.

Note that this is an intrinsic property of the matrix itself, not the result

of rounding error during matrix inversion. Poorly conditioned matrices amplify

pre-existing errors when we multiply by the true matrix inverse. In practice,

the error will be compounded further by numerical errors in the inversion process

itself. With iterative algorithms such as solving a linear system (or the worked-out

example of linear least square by gradient descent, Section 4.5) ill-conditioning

(in that case of the linear system matrix) yields very slow convergence of the

iterative algorithm, i.e., more iterations are needed to achieve some given degree

of approximation to the ﬁnal solution.

4.3 Gradient-Based Optimization

Most deep learning algorithms involve optimization of some sort. Optimization

refers to the task of either minimizing or maximizing some function f(x) by al-

tering x. We usually phrase most optimization problems in terms of minimizing

f(x). Maximization may be accomplished via a minimization algorithm by min-

imizing −f(x).

The function we want to minimize or maximize is called the objective function.

When we are minimizing it, we may also call it the cost function, loss function,

or error function.

We often denote the value that minimizes or maximizes a function with a

superscript ∗. For example, we might say x

∗

= arg min f(x).

We assume the reader is already familiar with calculus, but provide a brief

review of how calculus concepts relate to optimization here.

Suppose we have a function y = f(x), where both x and y are real numbers.

The derivative of this function is denoted as f

(x) or as

. The derivative f

(x)

gives the slope of f(x) at the point x. In other words, it speciﬁes how to scale

a small change in the input in order to obtain the corresponding change in the

output: f(x + ) ≈ f(x) + f

(x).

The derivative is therefore useful for minimizing a function because it tells us

how to change x in order to make a small improvement in y. For example, we

CHAPTER 4. NUMERICAL COMPUTATION

Figure 4.1: An illustration of how the derivatives of a function can be used to follow the

function downhill to a minimum. This technique is called gradient descent.

know that f(x − sign(f

(x))) is less than f(x) for small enough . We can thus

reduce f(x) by moving x in small steps with opposite sign of the the derivative.

This technique is called gradient descent (Cauchy, 1847). See Fig. 4.1 for an

example of this technique.

When f

(x) = 0, the derivative provides no information about which direction

to move. Points where f

(x) = 0 are known as critical points or stationary points.

A local minimum is a point where f(x) is lower than at all neighboring points,

so it is no longer possible to decrease f(x) by making inﬁnitesimal steps. A local

maximum is a point where f(x) is higher than at all neighboring points, so it is

not possible to increase f(x) by making inﬁnitesimal steps. Some critical points

are neither maxima nor minima. These are known as saddle points. See Fig. 4.2

for examples of each type of critical point.

A point that obtains the absolute lowest value of f(x) is a global minimum. It

is possible for there to be only one global minimum or multiple global minima of

the function. It is also possible for there to be local minima that are not globally

CHAPTER 4. NUMERICAL COMPUTATION

Figure 4.2: Examples of each of the three types of critical points in 1-D. A critical point is

a point with zero slope. Such a point can either be a local minimum, which is lower than

the neighboring points, a local maximum, which is higher than the neighboring points, or

a saddle point, which has neighbors that are both higher and lower than the point itself.

The situation in higher dimension is qualitatively diﬀerent, especially for saddle points:

see Figures 4.4 and 4.5.

CHAPTER 4. NUMERICAL COMPUTATION

Figure 4.3: Optimization algorithms may fail to ﬁnd a global minimum when there are

multiple local minima or plateaus present. In the context of deep learning, we generally

accept such solutions even though they are not truly minimal, so long as they corre-

spond to signiﬁcantly low values of the cost function. Optimizing MLPs was believed

to suﬀer from the presence of many local minima, but this idea is questioned in recent

work (Dauphin et al., 2014), with saddle points being considered as the more serious

issue.

optimal. In the context of deep learning, we optimize functions that may have

many local minima that are not optimal, and many saddle points surrounded by

very ﬂat regions. All of this makes optimization very diﬃcult, especially when the

input to the function is multidimensional. We therefore usually settle for ﬁnding

a value of f that is very low, but not necessarily minimal in any formal sense. See

Fig. 4.3 for an example. The ﬁgure caption also raises the question of whether

local minima or saddle points and plateaus are more to blame for the diﬃculties

one may encounter in training deep networks, a question that is discussed further

in Chapter 8, in particular Section 8.2.3.

We often minimize functions that have multiple inputs: f : R

→ R. Note

that for the concept of “minimization” to make sense, there must still be only

CHAPTER 4. NUMERICAL COMPUTATION

one output.

For these functions, we must make use of the concept of partial derivatives.

The partial derivative

∂

∂x

f(x) measures how f changes as only the variable x

increases at point x. The gradient of f is the vector containing all of the partial

derivatives, denoted ∇

f(x). Element i of the gradient is the partial derivative

of f with respect to x

. In multiple dimensions, critical points are points where

every element of the gradient is equal to zero.

The directional derivative in direction u (a unit vector) is the slope of the

function f in direction u. In other words, the derivative of the function f(x+αu)

with respect to α, evaluated at α = 0. Using the chain rule, we can see that this

is u

∇

f(x).

To minimize f , we would like to ﬁnd the direction in which f decreases the

fastest. We can do this using the directional derivative:

min

u,u

u=1

∇

f(x)

= min

u,u

u=1

||u||

||∇

f(x)||

cos θ

where θ is the angle between u and the gradient. Substituting in ||u||

= 1

and ignoring factors that don’t depend on u, this simpliﬁes to min

cos θ. This

is minimized when u points in the opposite direction as the gradient. In other

words, the gradient points directly uphill, and the negative gradient points directly

downhill. We can decrease f by moving in the direction of the negative gradient.

This is known as the method of steepest descent or gradient descent.

Steepest descent proposes a new point

= x − ∇

f(x)

where  is the size of the step. We can choose  in several diﬀerent ways. A

popular approach is to set  to a small constant. Sometimes, we can solve for

the step size that makes the directional derivative vanish. Another approach is to

evaluate f (x −∇

f(x)) for several values of  and choose the one that results

in the smallest objective function value. This last strategy is called a line search.

Steepest descent converges when every element of the gradient is zero (or, in

practice, very close to zero). In some cases, we may be able to avoid running

this iterative algorithm, and just jump directly to the critical point by solving the

equation ∇

f(x) = 0 for x.

Sometimes we need to ﬁnd all of the partial derivatives of all of the elements

of a vector-valued function. The matrix containing all such partial derivatives is

known as a Jacobian matrix. Speciﬁcally, if we have a function f : R

→ R

then the Jacobian matrix J ∈ R

n×m

of f is deﬁned such that J

i,j

∂

∂x

f(x)

CHAPTER 4. NUMERICAL COMPUTATION

We are also sometimes interested in a derivative of a derivative. This is known

as a second derivative. For example, for a function f : R

→ R, the derivative

with respect to x

of the derivative of f with respect to x

is denoted as

∂

∂x

In a single dimension, we can denote

f by f

(x).

The second derivative tells us how the ﬁrst derivative will change as we vary

the input. This means it can be useful for determining whether a critical point

is a local maximum, a local minimum, or saddle point. Recall that on a critical

point, f

(x) = 0. When f

(x) > 0, this means that f

(x) increases as we move to

the right, and f

(x) decreases as we move to the left. This means f

(x − ) < 0

and f

(x + ) > 0 for small enough . In other words, as we move right, the slope

begins to point uphill to the right, and as we move left, the slope begins to point

uphill to the left. Thus, when f

(x) = 0 and f

(x) > 0, we can conclude that x is

a local minimum. Similarly, when f

(x) = 0 and f

(x) < 0, we can conclude that

x is a local maximum. This is known as the second derivative test. Unfortunately,

when f

(x) = 0, the test is inconclusive. In this case x may be a saddle point, or

a part of a ﬂat region.

In multiple dimensions, we need to examine all of the second derivatives of

the function. These derivatives can be collected together into a matrix called the

Hessian matrix. The Hessian matrix H(f)(x) is deﬁned such that

H(f)(x)

i,j

∂

∂x

f(x).

Equivalently, the Hessian is the Jacobian of the gradient.

Anywhere that the second partial derivatives are continuous, the diﬀerential

operators are commutative, i.e. their order can be swapped:

∂

∂x

f(x) =

∂

∂x

f(x).

This implies that H

i,j

= H

j,i

, so the Hessian matrix is symmetric at such points.

Most of the functions we encounter in the context of deep learning have a sym-

metric Hessian almost everywhere. Because the Hessian matrix is real and sym-

metric, we can decompose it into a set of real eigenvalues and an orthogonal basis

of eigenvectors.

Using the eigendecomposition of the Hessian matrix, we can generalize the sec-

ond derivative test to multiple dimensions. At a critical point, where ∇

f(x) = 0,

we can examine the eigenvalues of the Hessian to determine whether the critical

point is a local maximum, local minimum, or saddle point. When the Hessian is

positive deﬁnite

, the point is a local minimum. This can be seen by observing

all its eigenvalues are positive

CHAPTER 4. NUMERICAL COMPUTATION

that the directional second derivative in any direction must be positive, and mak-

ing reference to the univariate second derivative test. Likewise, when the Hessian

is negative deﬁnite

, the point is a local maximum. In multiple dimensions, it is

actually possible to ﬁnd positive evidence of saddle points in some cases. When

at least one eigenvalue is positive and at least one eigenvalue is negative, we know

that x is a local maximum on one cross section of f but a local minimum on

another cross section. See Fig. 4.4 for an example. Finally, the multidimensional

second derivative test can be inconclusive, just like the univariate version. The

test is inconclusive whenever all of the non-zero eigenvalues have the same sign,

but at least one eigenvalue is zero. This is because the univariate second derivative

test is inconclusive in the cross section corresponding to the zero eigenvalue.

The Hessian can also be useful for understanding the performance of gradient

descent. When the Hessian has a poor condition number, gradient descent per-

forms poorly. This is because in one direction, the derivative increases rapidly,

while in another direction, it increases slowly. Gradient descent is unaware of this

change in the derivative so it does not know that it needs to explore preferentially

in the direction where the derivative remains negative for longer. See Fig. 4.5 for

an example.

This issue can be resolved by using information from the Hessian matrix to

guide the search. The simplest method for doing so is known as Newton’s method.

Newton’s method is based on using a second-order Taylor series expansion to

approximate f(x) near some point x

, ignoring derivatives of higher order:

f(x) ≈ f(x

) + (x − x

)

∇

f(x

) +

(x − x

)

H(f)(x

)(x − x

If we then solve for the critical point of this function, we obtain:

∗

= x

− H(f)(x

)

−1

∇

f(x

When the function can be locally approximated as quadratic, iteratively updat-

ing the approximation and jumping to the minimum of the approximation can

reach the critical point much faster than gradient descent would. This is a useful

property near a local minimum, but it can be a harmful property near a saddle

point. As discussed in Section 8.2.3, Newton’s method is only appropriate when

the nearby critical point is a minimum (all the eigenvalues of the Hessian are pos-

itive), whereas gradient descent can in principle escape a saddle point, although

it may take a lot of time if the negative eigenvalues are very small in magnitude,

producing a kind of plateau around the saddle point.

Optimization algorithms such as gradient descent that use only the gradient

are called ﬁrst-order optimization algorithms. Optimization algorithms such as

all its eigenvalues are negative

CHAPTER 4. NUMERICAL COMPUTATION

Figure 4.4: A saddle point containing both positive and negative curvature. The function

in this example is f(x) = x

− x

. Along the axis corresponding to x

, the function

curves upward. This axis is an eigenvector of the Hessian and has a positive eigenvalue.

Along the axis corresponding to x

, the function curves downward. This direction is

an eigenvector of the Hessian with negative eigenvalue. The name “saddle point” derives

from the saddle-like shape of this function. This is the quintessential example of a function

with a saddle point. Note that in more than one dimension, it is not necessary to have an

eigenvalue of 0 in order to get a saddle point: it is only necessary to have both positive

and negative eigenvalues. See Section 8.2.3 for a longer discussion of saddle points in

deep nets.

CHAPTER 4. NUMERICAL COMPUTATION

Figure 4.5: Gradient descent fails to exploit the curvature information contained in the

Hessian matrix. Here we use gradient descent on a quadratic function whose Hessian

matrix has condition number 5 (curvature is 5 times larger in one direction than in some

other direction). The lines above the mesh indicate the path followed by gradient descent.

This very elongated quadratic function resembles a long canyon. Gradient descent wastes

time repeatedly descending canyon walls, because they are the steepest feature. Because

the step size is somewhat too large, it has a tendency to overshoot the bottom of the

function and thus needs to descend the opposite canyon wall on the next iteration. The

large positive eigenvalue of the Hessian corresponding to the eigenvector pointed in this

direction indicates that this directional derivative is rapidly increasing, so an optimization

algorithm based on the Hessian could predict that the steepest direction is not actually a

promising search direction in this context. Note that some recent results suggest that the

above picture is not representative for deep highly non-linear networks. See Section 8.2.4

for more on this subject.

CHAPTER 4. NUMERICAL COMPUTATION

Newton’s method that also use the Hessian matrix are called second-order opti-

mization algorithms (Nocedal and Wright, 2006).

The optimization algorithms employed in most contexts in this book are ap-

plicable to a wide variety of functions, but come with almost no guarantees. This

is because the family of functions used in deep learning is quite complicated. In

many other ﬁelds, the dominant approach to optimization is to design optimiza-

tion algorithms for a limited family of functions. Perhaps the most successful

ﬁeld of specialized optimization is convex optimization. Convex optimization al-

gorithms are able to provide many more guarantees, but are applicable only to

functions for which the Hessian is positive deﬁnite everywhere. Such functions

are well-behaved because they lack saddle points and all of their local minima are

necessarily global minima. However, most problems in deep learning are diﬃcult

to express in terms of convex optimization. Convex optimization is used only as

a subroutine of some deep learning algorithms. Ideas from the analysis of convex

optimization algorithms can be useful for proving the convergence of deep learn-

ing algorithms. However, in general, the importance of convex optimization is

greatly diminished in the context of deep learning. For more information about

convex optimization, see Boyd and Vandenberghe (2004) or Rockafellar (1997).

4.4 Constrained Optimization

Sometimes we wish not only to maximize or minimize a function f(x) over all

possible values of x. Instead we may wish to ﬁnd the maximal or minimal value

of f(x) for values of x in some set S. This is known as constrained optimiza-

tion. Points x that lie within the set S are called feasible points in constrained

optimization terminology.

One simple approach to constrained optimization is simply to modify gradient

descent taking the constraint into account. If we use a small constant step size ,

we can make gradient descent steps, then project the result back into S. If we use

a line search (see previous section), we can search only over step sizes  that yield

new x points that are feasible, or we can project each point on the line back into

the constraint region. When possible, this method can be made more eﬃcient by

projecting the gradient into the tangent space of the feasible region before taking

the step or beginning the line search (Rosen, 1960).

A more sophisticated approach is to design a diﬀerent, unconstrained opti-

mization problem whose solution can be converted into a solution to the original,

constrained optimization problem. For example, if we want to minimize f(x) for

x ∈ R

with x constrained to have exactly unit L

norm, we can instead minimize

g(θ) = f([cos θ, sin θ]

) with respect to θ, then return [cos θ, sin θ] as the solution

to the original problem. This approach requires creativity; the transformation

CHAPTER 4. NUMERICAL COMPUTATION

between optimization problems must be designed speciﬁcally for each case we

encounter.

The Karush–Kuhn–Tucker (KKT) approach

provides a very general solu-

tion to constrained optimization. With the KKT approach, we introduce a new

function called the generalized Lagrangian or generalized Lagrange function.

To deﬁne the Lagrangian, we ﬁrst need to describe S in terms of equations

and inequalities. We want a description of S in terms of m functions g

and n

functions h

so that S = {x | ∀i, g

(x) = 0 and ∀j, h

(x) ≤ 0}. The equations

involving g

are called the equality constraints and the inequalities involving h

are called inequality constraints.

We introduce new variables λ

and α

for each constraint, these are called the

KKT multipliers. The generalized Lagrangian is then deﬁned as

L(x, λ, α) = f(x) +

(x) +

(x).

We can now solve a constrained minimization problem using unconstrained

optimization of the generalized Lagrangian. Observe that, so long as at least one

feasible point exists and f(x) is not permitted to have value ∞, then

min

max

α,α≥0

L(x, λ, α).

has the same optimal objective function value and set of optimal points x as

min

x∈S

f(x).

This follows because any time the constraints are satisﬁed,

max

α,α≥0

L(x, λ, α) = f(x),

while any time a constraint is violated,

max

α,α≥0

L(x, λ, α) = ∞.

These properties guarantee that no infeasible point will ever be optimal, and that

the optimum within the feasible points is unchanged.

To perform constrained maximization, we can construct the generalized La-

grange function of −f(x), which leads to this optimization problem:

min

max

α,α≥0

−f(x) +

(x) +

(x).

The KKT approach generalizes the method of Lagrange multipliers which only allows equal-

ity constraints

CHAPTER 4. NUMERICAL COMPUTATION

We may also convert this to a problem with maximization in the outer loop:

max

min

α,α≥0

f(x) +

(x) −

(x).

Note that the sign of the term for the equality constraints does not matter; we

may deﬁne it with addition or subtraction as we wish, because the optimization

is free to choose any sign for each λ

The inequality constraints are particularly interesting. We say that a con-

straint h

(x) is active if h

∗

) = 0. If a constraint is not active, then the solution

to the problem is the same whether or not that constraint exists. Because an in-

active h

has negative value, then the solution to min

max

α,α≥0

L(x, λ, α)

will have α

= 0. We can thus observe that at the solution, αh(x) = 0. In

other words, for all i, we know that at least one of the constraints α

≥ 0 and

(x) ≤ 0 must be active at the solution. To gain some intuition for this idea, we

can say that either the solution is on the boundary imposed by the inequality and

we must use its KKT multiplier to inﬂuence the solution to x, or the inequality

has no inﬂuence on the solution and we represent this by zeroing out its KKT

multiplier.

The properties that the gradient of the generalized Lagrangian is zero, all

constraints on both x and the KKT multipliers are satisﬁed, and α  h(x) =

0 are called the Karush-Kuhn-Tucker (KKT) conditions (Karush, 1939; Kuhn

and Tucker, 1951). Together, these properties describe the optimal points of

constrained optimization problems.

In the case where there are no inequality constraints, the KKT approach

simpliﬁes to the method of Lagrange multipliers. For more information about the

KKT approach, see Nocedal and Wright (2006).

4.5 Example: Linear Least Squares

Suppose we want to ﬁnd the value of x that minimizes

f(x) =

||Ax − b||

There are specialized linear algebra algorithms that can solve this problem eﬃ-

ciently. However, we can also explore how to solve it using gradient-based opti-

mization as a simple example of how these techniques work.

First, we need to obtain the gradient:

∇

f(x) = A

(Ax − b) = A

Ax − A

We can then follow this gradient downhill, taking small steps. See Algo-

rithm 4.1 for details.

CHAPTER 4. NUMERICAL COMPUTATION

Algorithm 4.1 An algorithm to minimize f(x) =

||Ax − b||

with respect to

x using gradient descent.

Set , the step size, and δ, the tolerance, to small, positive numbers.

while ||A

Ax − A

b||

> δ do

x ← x −



Ax − A



end while

One can also solve this problem using Newton’s method. In this case, because

the true function is quadratic, the quadratic approximation employed by Newton’s

method is exact, and the algorithm converges to the global minimum in a single

step.

Now suppose we wish to minimize the same function, but subject to the con-

straint x

x ≤ 1. To do so, we introduce the Lagrangian

L(x, λ) = f(x) + λ



x − 1



We can now solve the problem

min

max

λ,λ≥0

L(x, λ).

The solution to the unconstrained least squares problem is given by x = A

−1

If this point is feasible, then it is the solution to the constrained problem. Oth-

erwise, we must ﬁnd a solution where the constraint is active. By diﬀerentiating

the Lagrangian with respect to x, we obtain the equation

Ax − b + 2λx = 0.

This tells us that the solution will take the form

x = (A + 2λI)

−1

The magnitude of λ must be chosen such that the result obeys the constraint.

We can ﬁnd this value by performing gradient ascent on λ. To do so, observe

∂

∂λ

L(x, λ) = x

x − 1.

When the norm of x exceeds 1, this derivative is positive, so to ascend the gradient

and increase the Lagrangian with respect to λ, we increase λ. This will in turn

shrink the optimal x. The process continues until x has the correct norm and the

derivative on λ is 0.