
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
are known to be learnable, but when we compose them, it is much more difficult
to optimize the neural network (including a large variety of architectures), while
other methods such as SVMs, boosting and decision trees also fail. This is an
instance where the optimization difficulty was solved by introducing prior knowl-
edge in the form of hints, specifically hints about what the intermediate layer in
a deep net should be doing. We have already seen in Section 8.7.4 that a useful
strategy is to ask the hidden units to extract features that are useful to the super-
vised task at hand, with greedy supervised pre-training. In section 16.1 we will
discuss an unsupervised version of this idea, where we ask the intermediate layers
to extract features that are good explaining the variations in the input, without
reference to a specific supervised task. Another related line of work is the Fit-
Nets (Romero et al., 2015), where the middle layer of 5-layer supervised teacher
network is used as a hint to be predicted by the middle layer of a much deeper
student network (11 to 19 layers). In that case, additional parameters are intro-
duced to regress the middle layer of the 5-layer teacher network from the middle
layer of the deeper student network. The lower layers of the student networks
thus get two objectives: help the outputs of the student network accomplish their
task, as well as predict the intermediate layer of the teacher network. Although
a deeper network is usually more difficult to optimize, it can generalize better (it
has to extract these more abstract and non-linear features). Romero et al. (2015)
were motivated by the fact that a deep student network with a smaller number
of hidden units per layer can have a lot less parameters (and faster computation)
than a fatter shallower network and yet achieve the same or better generalization,
thus allowing a trade-off between better generalization (with 3 times fewer pa-
rameters) and faster test-time computation (up to 10 fold, in the paper, using a
very thin and deep network with 35 times less parameters). Without the hints on
the hidden layer, the student network performed very poorly in the experiments,
both on the training and test set.
These drastic effects of initialization and hints to middle layers bring forth
the question of what is sometimes called global optimization (Horst et al., 2000),
the main subject of this section. The objective of global optimization methods is
to find better solutions than local descent minimizers, i.e., ideally find a global
minimum of the objective function and not simply a local minimum. If one could
restart a local optimization method from a very large number of initial conditions,
one could imagine that the global minimum could be found, but there are more
efficient approaches.
Two fairly general approaches to global optimization are continuation meth-
ods (Wu, 1997), a deterministic approach, and simulated annealing (Kirkpatrick
et al., 1983), a stochastic approach. They both proceed from the intuition that
if we sufficiently blur a non-convex objective function (e.g. convolve it with a
271