
CHAPTER 7. REGULARIZATION
in GPU memory, but storing the optimal parameters in host memory or on a
disk drive). Since the best parameters are written to infrequently and never read
during training, these occasional slow writes are have little effect on the total
training time.
Early stopping is a very inobtrusive form of regularization, in that it requires
no change the underlying training procedure, the objective function, or the set of
allowable parameter values. This means that it is easy to use early stopping with-
out damaging the learning dynamics. This is in contrast to weight decay, where
one must be careful not to use too much weight decay and trap the network in a
bad local minima corresponding to a solution with pathologically small weights.
Early stopping may be used either alone or in conjunction with other regu-
larization strategies. Even when using regularization strategies that modify the
objective function to encourage better generalization, it is rare for the best gen-
eralization to occur at a local minimum of the training objective.
Early stopping requires a validation set, which means some training data is not
fed to the model. To best exploit this extra data, one can perform extra training
after the initial training with early stopping has completed. In the second, extra
training step, all of the training data is included. There are two basic strategies
one can use for this second training procedure.
One strategy is to initialize the model again and retrain on all of the data.
In this second training pass, we train for the same number of steps as the early
stopping procedure determined was optimal in the first pass. There are some
subtleties associated with this procedure. For example, there is not a good way
of knowing whether to retrain for the same number of parameter updates or the
same number of passes through the dataset. On the second round of training,
each pass through the dataset will require more parameter updates because the
training set is bigger. Usually, if overfitting is a serious concern, you will want to
retrain for the same number of epochs, rather than the same number of parameter
udpates. If the primary difficulty is optimization rather than generalization, then
retraining for the same number of parameter updates makes more sense (but it’s
also less likely that you need to use a regularization method like early stopping
in the first place). This algorithm is described more formally in Alg. 7.2.
Another strategy for using all of the data is to keep the parameters obtained
from the first round of training and then continue training but now using all of the
data. At this stage, we now no longer have a guide for when to stop in terms of a
number of steps. Instead, we can monitor the loss function on the validation set,
and continue training until it falls below the value of the training set objective at
which the early stopping procedure halted. This strategy avoids the high cost of
retraining the model from scratch, but is not as well-behaved. For example, there
is not any guarantee that the objective on the validation set will ever reach the
194