
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
the risk. If we knew the true distribution p(x, y), this would be an optimization
task solveable by an optimization algorithm. However, when we do not know
p(x, y) but only have a training set of samples from it, we have a machine learning
problem.
The simplest way to convert a machine learning problem back into an opti-
mization problem is to minimize the expected loss on the training set. This means
replacing the true distribution p(x, y) with the empirical distribution ˆp(x, y) de-
fined by the training set. We now minimize the empirical risk
E
x,y∼ˆp(x,y)
[L(x, y)] =
1
m
m
X
i=1
L(x
(i)
, y
(i)
)
where m is the number of training examples.
This process is known as empirical risk minimization. In this setting, machine
learning is still very similar to straightforward optimization. Rather than opti-
mizing the risk directly, we optimize the empirical risk, and hope that the risk
decreases significantly as well. A variety of theoretical results establish conditions
under which the true risk can be expected to decrease by various amounts.
However, empirical risk minimization is prone to overfitting. Models with
high capacity can simply memorize the training set. In many cases, empirical
risk minimization is not really feasible. The most effective modern optimization
algorithms are based on gradient descent, but many useful loss functions, such
as 0-1 loss, have no useful derivatives (the derivative is either zero or undefined
everywhere). These two problems mean that, in the context of deep learning, we
rarely use empirical risk minimization. Instead, we must use a slightly different
approach, in which the quantity that we actually optimize is even more different
from the quantity that we truly want to optimize.
TODO– make sure 0-1 loss is defined and in the index
8.1.2 Surrogate Loss Functions
TODO–coordinate with Yoshua / coordinate with MLP / ML chapters do we use
term loss function = map from a specific example to a real number or do we use it
interchangeably with objective function / cost function? it seems some literature
uses ”loss function” in a very general sense while others use it to mean specifically
a single-example cost that you can take the expectation of, etc. this terminology
seems a bit sub-optimal since it relies a lot on using English words with essentially
the same meaning to represent different things with precise technical meanings
are ”surrogate loss functions” specifically replacing the cost for an individual
examples, or does this also include things like minimizing the empirical risk rather
than the true risk, adding a regularization term to the likelihood terms, etc.?
206