
CHAPTER 4. NUMERICAL COMPUTATION
Newton’s method that also use the Hessian matrix are called second-order opti-
mization algorithms (Nocedal and Wright, 2006).
The optimization algorithms employed in most contexts in this book are ap-
plicable to a wide variety of functions, but come with almost no guarantees. This
is because the family of functions used in deep learning is quite complicated. In
many other fields, the dominant approach to optimization is to design optimiza-
tion algorithms for a limited family of functions. Perhaps the most successful
field of specialized optimization is convex optimization. Convex optimization al-
gorithms are able to provide many more guarantees, but are applicable only to
functions for which the Hessian is positive definite everywhere. Such functions
are well-behaved because they lack saddle points and all of their local minima are
necessarily global minima. However, most problems in deep learning are difficult
to express in terms of convex optimization. Convex optimization is used only as
a subroutine of some deep learning algorithms. Ideas from the analysis of convex
optimization algorithms can be useful for proving the convergence of deep learn-
ing algorithms. However, in general, the importance of convex optimization is
greatly diminished in the context of deep learning. For more information about
convex optimization, see Boyd and Vandenberghe (2004) or Rockafellar (1997).
4.4 Constrained Optimization
Sometimes we wish not only to maximize or minimize a function f(x) over all
possible values of x. Instead we may wish to find the maximal or minimal value
of f(x) for values of x in some set S. This is known as constrained optimiza-
tion. Points x that lie within the set S are called feasible points in constrained
optimization terminology.
One simple approach to constrained optimization is simply to modify gradient
descent taking the constraint into account. If we use a small constant step size ,
we can make gradient descent steps, then project the result back into S. If we use
a line search (see previous section), we can search only over step sizes that yield
new x points that are feasible, or we can project each point on the line back into
the constraint region. When possible, this method can be made more efficient by
projecting the gradient into the tangent space of the feasible region before taking
the step or beginning the line search (Rosen, 1960).
A more sophisticated approach is to design a different, unconstrained opti-
mization problem whose solution can be converted into a solution to the original,
constrained optimization problem. For example, if we want to minimize f(x) for
x ∈ R
2
with x constrained to have exactly unit L
2
norm, we can instead minimize
g(θ) = f([cos θ, sin θ]
T
) with respect to θ, then return [cos θ, sin θ] as the solution
to the original problem. This approach requires creativity; the transformation
80