
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
duced to regress the middle layer of the 5-layer teacher network from the middle
layer of the deeper student network. The lower layers of the student networks
thus get two objectives: help the outputs of the student network accomplish their
task, as well as predict the intermediate layer of the teacher network. Although
a deeper network is usually more difficult to optimize, it can generalize better (it
has to extract these more abstract and non-linear features). Romero et al. (2015)
were motivated by the fact that a deep student network with a smaller number
of hidden units per layer can have a lot less parameters (and faster computation)
than a fatter shallower network and yet achieve the same or better generalization,
thus allowing a trade-off between better generalization (with 3 times fewer pa-
rameters) and faster test-time computation (up to 10 fold, in the paper, using a
very thin and deep network with 35 times less parameters). Without the hints on
the hidden layer, the student network performed very poorly in the experiments,
both on the training and test set.
These drastic effects of initialization and hints to middle layers bring forth
the question of what is sometimes called global optimization (Horst et al., 2000),
the main subject of this section. The objective of global optimization methods is
to find better solutions than local descent minimizers, i.e., ideally find a global
minimum of the objective function and not simply a local minimum. If one could
restart a local optimization method from a very large number of initial conditions,
one could imagine that the global minimum could be found, but there are more
efficient approaches.
Two fairly general approaches to global optimization are continuation meth-
ods (Wu, 1997), a deterministic approach, and simulated annealing (Kirkpatrick
et al., 1983), a stochastic approach. They both proceed from the intuition that
if we sufficiently blur a non-convex objective function (e.g. convolve it with a
Gaussian) whose global minima arae not at infinite values, then it becomes con-
vex, and finding the global optimum of that blurred objective function should be
much easier. As illustrated in Figure 8.5, by gradually changing the objective
function from a very blurred easy to optimize version to the original crisp and
difficult objective function, we are actually likely to find better local minima. In
the case of simulated annealing, the blurring occurs because of injecting noise.
With injected noise, the state of the system can sometimes go uphill, and thus
does not necessarily get stuck in a local minimum. With a lot of noise, the ef-
fective objective function (averaged over the noise) is flatter and convex, and if
the amount of noise is reduced sufficiently slowly, then one can show convergence
to the global minimum. However, the annealing schedule (the rate at which the
noise level is decreased, or equivalently the temperature is decreased when you
think of the physical annealing analogy) might need to be extremely slow, so an
NP-hard optimization problem remains NP-hard.
245