
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
nonlinear activation functions. This initialization scheme is also motivated by a
model of a deep network as a sequence of matrix multiplies without nonlinearities.
Under such a model, this initialization scheme guarantees that the total number of
training iterations required to reach convergence is independent of depth.
Increasing the scaling factor
g
pushes the network toward the regime where
activations increase in norm as they propagate forward through the network and
gradients increase in norm as they propagate backward. Sussillo (2014) showed
that setting the gain factor correctly is sufficient to train networks as deep as
1,000 layers, without needing to use orthogonal initializations. A key insight of
this approach is that in feedforward networks, activations and gradients can grow
or shrink on each step of forward or back-propagation, following a random walk
behavior. This is because feedforward networks use a different weight matrix at
each layer. If this random walk is tuned to preserve norms, then feedforward
networks can mostly avoid the vanishing and exploding gradients problem that
arises when the same weight matrix is used at each step, described in Sec. 8.2.5.
Unfortunately, these optimal criteria for initial weights often do not lead to
optimal performance. This may be for three different reasons. First, we may
be using the wrong criteria—it may not actually be beneficial to preserve the
norm of a signal throughout the entire network. Second, the properties imposed
at initialization may not persist after learning has begun to proceed. Third, the
criteria might succeed at improving the speed of optimization but inadvertently
increase generalization error. In practice, we usually need to treat the scale of the
weights as a hyperparameter whose optimal value lies somewhere roughly near but
not exactly equal to the theoretical predictions.
One drawback to scaling rules that set all of the initial weights to have the same
standard deviation, such as
1
√
m
, is that every individual weight becomes extremely
small when the layers become large. Martens (2010) introduced an alternative
initialization scheme called sparse initialization in which each unit is initialized to
have exactly
k
non-zero weights. The idea is to keep the total amount of input to
the unit independent from the number of inputs
m
without making the magnitude
of individual weight elements shrink with
m
. Sparse initialization helps to achieve
more diversity among the units at initialization time. However, it also imposes
a very strong prior on the weights that are chosen to have large Gaussian values.
Because it takes a long time for gradient descent to shrink “incorrect” large values,
this initialization scheme can cause problems for units such as maxout units that
have several filters that must be carefully coordinated with each other.
When computational resources allow it, it is usually a good idea to treat the
initial scale of the weights for each layer as a hyperparameter, and to choose these
306