MNIST dataset (not using any prior knowledge about images) is attained by a clas-
sifier that uses both dropout regularization and deep Boltzmann machine pretraining.
However, combining dropout with unsupervised pretraining has not become a popular
strategy for larger models and more challenging datasets.
One advantage of dropout is that it is very computationally cheap. Using dropout
during training requires only O(n) computation per example per update, to generate
n random binary numbers and multiply them by the state. Depending on the imple-
mentation, it may also require O(n) memory to store these binary numbers until the
backpropagation stage. Running inference in the trained model has the same cost per-
example as if dropout were not used, though we must pay the cost of dividing the weights
by 2 once before beginning to run inference on examples.
One significant advantage of dropout is that it does not significantly limit the type
of model or training procedure that can be used. It works well with nearly any model
that uses a distributed representation and can be trained with stochastic gradient de-
scent. This includes feedforward neural networks, probabilistic models such as restricted
Boltzmann machines (Srivastava et al., 2014), and recurrent neural networks (Pascanu
et al., 2014a). This is very different from many other neural network regularization
strategies, such as those based on unsupervised pretraining or semi-supervised learning.
Such regularization strategies often impose restrictions such as not being able to use
rectified linear units or max pooling. Often these restrictions incur enough harm to
outweigh the benefit provided by the regularization strategy.
Though the cost per-step of applying dropout to a specific model is negligible, the
cost of using dropout in a complete system can be significant. This is because the
size of the optimal model (in terms of validation set error) is usually much larger, and
because the number of steps required to reach convergence increases. This is of course
to be expected from a regularization method, but it does mean that for very large
datasets (as a rough rule of thumb, dropout is unlikely to be beneficial when more than
15 million training examples are available, though the exact boundary may be highly
problem dependent) it is often preferable not to use dropout at all, just to speed training
and reduce the computational cost of the final model.
When extremely few labeled training examples are available, dropout is less effective.
Bayesian neural networks (Neal, 1996) outperform dropout on the Alternative Splicing
Dataset (Xiong et al., 2011) where fewer than 5,000 examples are available (Srivastava
et al., 2014). When additional unlabeled data is available, unsupervised feature learning
can gain an advantage over dropout.
TODO– ”Dropout Training as Adaptive Regularization” ? (Wager et al., 2013)
TODO–perspective as L2 regularization TODO–connection to adagrad? TODO–semi-
supervised variant TODO–Baldi paper (Baldi and Sadowski, 2013) TODO–DWF paper
(Warde-Farley et al., 2014) TODO–using geometric mean is not a problem TODO–
dropout boosting, it’s not just noise robustness TODO–what was the conclusion about
mixability?
The stochasticity used while training with dropout is not a necessary part of the
model’s success. It is just a means of approximating the sum over all sub-models.
Wang and Manning (2013) derived analytical approximations to this marginalization.
153