
CHAPTER 7. REGULARIZATION
One significant advantage of dropout is that it does not significantly limit
the type of model or training procedure that can be used. It works well with
nearly any model that uses a distributed representation and can be trained with
stochastic gradient descent. This includes feedforward neural networks, proba-
bilistic models such as restricted Boltzmann machines (Srivastava et al., 2014),
and recurrent neural networks (Pascanu et al., 2014a). This is very different from
many other neural network regularization strategies, such as those based on un-
supervised pretraining or semi-supervised learning. Such regularization strategies
often impose restrictions such as not being able to use rectified linear units or
max pooling. Often these restrictions incur enough harm to outweigh the benefit
provided by the regularization strategy.
Though the cost per-step of applying dropout to a specific model is negligible,
the cost of using dropout in a complete system can be significant. This is because
the size of the optimal model (in terms of validation set error) is usually much
larger, and because the number of steps required to reach convergence increases.
This is of course to be expected from a regularization method, but it does mean
that for very large datasets (as a rough rule of thumb, dropout is unlikely to be
beneficial when more than 15 million training examples are available, though the
exact boundary may be highly problem dependent) it is often preferable not to
use dropout at all, just to speed training and reduce the computational cost of
the final model.
When extremely few labeled training examples are available, dropout is less
effective. Bayesian neural networks (Neal, 1996) outperform dropout on the Alter-
native Splicing Dataset (Xiong et al., 2011) where fewer than 5,000 examples are
available (Srivastava et al., 2014). When additional unlabeled data is available,
unsupervised feature learning can gain an advantage over dropout.
TODO– ”Dropout Training as Adaptive Regularization” ? (Wager et al.,
2013) TODO–perspective as L2 regularization TODO–connection to adagrad?
TODO–semi-supervised variant TODO–Baldi paper (Baldi and Sadowski, 2013)
TODO–DWF paper (Warde-Farley et al., 2014) TODO–using geometric mean
is not a problem TODO–dropout boosting, it’s not just noise robustness TODO–
what was the conclusion about mixability (DWF)?
The stochasticity used while training with dropout is not a necessary part
of the model’s success. It is just a means of approximating the sum over all
sub-models. Wang and Manning (2013) derived analytical approximations to
this marginalization. Their approximation, known as fast dropout resulted in
faster convergence time due to the reduced stochasticity in the computation of
the gradient. This method can also be applied at test time, as a more principled
(but also more computationally expensive) approximation to the average over
all sub-networks than the weight scaling approximation. Fast dropout has been
221