where m is the number of training examples.
This process is known as empirical risk minimization. In this setting, machine learn-
ing is still very similar to straightforward optimization. Rather than optimizing the risk
directly, we optimize the empirical risk, and hope that the risk decreases significantly as
well. A variety of theoretical results establish conditions under which the true risk can
be expected to decrease by various amounts.
However, empirical risk minimization is prone to overfitting. Models with high ca-
pacity can simply memorize the training set. In many cases, empirical risk minimization
is not really feasible. The most effective modern optimization algorithms are based on
gradient descent, but many useful loss functions, such as 0-1 loss, have no useful deriva-
tives (the derivative is either zero or undefined everywhere). These two problems mean
that, in the context of deep learning, we rarely use empirical risk minimization. In-
stead, we must use a slightly different approach, in which the quantity that we actually
optimize is even more different from the quantity that we truly want to optimize.
TODO– make sure 0-1 loss is defined and in the index
8.1.2 Surrogate Loss Functions
TODO–coordinate with Yoshua / coordinate with MLP / ML chapters do we use term
loss function = map from a specific example to a real number or do we use it inter-
changeably with objective function / cost function? it seems some literature uses ”loss
function” in a very general sense while others use it to mean specifically a single-example
cost that you can take the expectation of, etc. this terminology seems a bit sub-optimal
since it relies a lot on using English words with essentially the same meaning to represent
different things with precise technical meanings are ”surrogate loss functions” specifi-
cally replacing the cost for an individual examples, or does this also include things like
minimizing the empirical risk rather than the true risk, adding a regularization term to
the likelihood terms, etc.?
TODO– in some cases, surrogate loss function actually results in being able to learn
more. for example, test 0-1 loss continues to decrease for a long time after train 0-1 loss
has reached zero when training using log likelihood surrogate
In some cases, using a surrogate loss function allows us to extract more information
8.1.3 Generalization
TODO– SGD on an infinite dataset optimizes the generalization error directly (note that
SGD is not introduced until later so this will need to be presented carefully) TODO–
A very important difference between optimization in general and optimization as we
use it for training algorithms is that training algorithms do not usually halt at a local
minimum. Instead, using a regularization method known as early stopping (see Sec.
7.8), they halt whenever overfitting begins to occur. This is often in the middle of a
wide, flat region, but it can also occur on a steep part of the surrogate loss function.
This is in contrast to general optimization, where converge is usually defined by arriving
at a point that is very near a (local) minimum.
157