Deep Learning
Yoshua Bengio
Ian J. Goodfellow
Aaron Courville
December 5, 2014
Table of Contents
1 Deep Learning for AI 2
1.1 Who should read this book? . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Historical Perspective and Neural Networks . . . . . . . . . . . . . . . . 14
1.4 Recent Impact of Deep Learning Research . . . . . . . . . . . . . . . . . 15
1.5 Challenges for Future Research . . . . . . . . . . . . . . . . . . . . . . . 17
2 Linear algebra 20
2.1 Scalars, vectors, matrices and tensors . . . . . . . . . . . . . . . . . . . . 20
2.2 Multiplying matrices and vectors . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Identity and inverse matrices . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Linear dependence, span, and rank . . . . . . . . . . . . . . . . . . . . . 25
2.5 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 Special kinds of matrices and vectors . . . . . . . . . . . . . . . . . . . . 27
2.7 Eigendecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.8 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . 30
2.9 The trace operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.10 Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.11 Example: Principal components analysis . . . . . . . . . . . . . . . . . . 31
3 Probability and Information Theory 35
3.1 Why probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.1 Discrete variables and probability mass functions . . . . . . . . . 37
3.3.2 Continuous variables and probability density functions . . . . . . 38
3.4 Marginal probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6 The chain rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7 Independence and conditional independence . . . . . . . . . . . . . . . . 40
3.8 Expectation, variance, and covariance . . . . . . . . . . . . . . . . . . . 41
3.9 Information theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.10 Common probability distributions . . . . . . . . . . . . . . . . . . . . . 44
1
3.10.1 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . 44
3.10.2 Multinoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . 44
3.10.3 Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . 45
3.10.4 Dirac Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.10.5 Mixtures of Distributions and Gaussian Mixture . . . . . . . . . 48
3.11 Useful properties of common functions . . . . . . . . . . . . . . . . . . . 48
3.12 Bayes’ rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.13 Technical details of continuous variables . . . . . . . . . . . . . . . . . . 51
3.14 Example: Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4 Numerical Computation 56
4.1 Overflow and underflow . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Poor conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 Gradient-Based Optimization . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 Constrained optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5 Example: linear least squares . . . . . . . . . . . . . . . . . . . . . . . . 68
5 Machine Learning Basics 70
5.1 Learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.1 The task, T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.2 The performance measure, P . . . . . . . . . . . . . . . . . . . . 72
5.1.3 The experience, E . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Example: Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3 Generalization, Capacity, Overfitting and Underfitting . . . . . . . . . . 76
5.3.1 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3.2 Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.3 Occam’s Razor, Underfitting and Overfitting . . . . . . . . . . . 78
5.4 Estimating and Monitoring Generalization Error . . . . . . . . . . . . . 81
5.5 Estimators, Bias, and Variance . . . . . . . . . . . . . . . . . . . . . . . 81
5.5.1 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.5.2 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.5.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.5.4 Trading off Bias and Variance and the Mean Squared Error . . . 85
5.5.5 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.6 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . 86
5.6.1 Properties of Maximum Likelihood . . . . . . . . . . . . . . . . . 87
5.6.2 Regularized Likelihood . . . . . . . . . . . . . . . . . . . . . . . . 87
5.7 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.8 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.8.1 Estimating Conditional Expectation by Minimizing Squared Error 88
5.8.2 Estimating Probabilities or Conditional Probabilities by Maxi-
mum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.9 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.9.1 Principal Components Analysis . . . . . . . . . . . . . . . . . . . 90
2
5.10 Weakly supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.11 The Smoothness Prior, Local Generalization and Non-Parametric Models 95
5.12 Manifold Learning and the Curse of Dimensionality . . . . . . . . . . . . 99
5.13 Challenges of High-Dimensional Distributions . . . . . . . . . . . . . . . 102
6 Feedforward Deep Networks 104
6.1 Formalizing and Generalizing Neural Networks . . . . . . . . . . . . . . 104
6.2 Parametrizing a Learned Predictor . . . . . . . . . . . . . . . . . . . . . 107
6.2.1 Family of Functions . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.2.2 Loss Function and Conditional Log-Likelihood . . . . . . . . . . 108
6.2.3 Training Criterion and Regularizer . . . . . . . . . . . . . . . . . 113
6.2.4 Optimization Procedure . . . . . . . . . . . . . . . . . . . . . . . 114
6.3 Flow Graphs and Back-Propagation . . . . . . . . . . . . . . . . . . . . 115
6.3.1 Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.3.2 Back-Propagation in a General Flow Graph . . . . . . . . . . . . 118
6.4 Universal Approximation Properties and Depth . . . . . . . . . . . . . . 122
6.5 Feature / Representation Learning . . . . . . . . . . . . . . . . . . . . . 124
6.6 Piecewise Linear Hidden Units . . . . . . . . . . . . . . . . . . . . . . . 125
6.7 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7 Regularization 127
7.1 Classical Regularization: Parameter Norm Penalty . . . . . . . . . . . . 128
7.1.1 L
2
parameter regularization . . . . . . . . . . . . . . . . . . . . . 129
7.1.2 L
1
regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.1.3 L
regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.2 Classical regularization as constrained optimization . . . . . . . . . . . . 132
7.3 Regularization from a Bayesian perspective . . . . . . . . . . . . . . . . 134
7.4 Early stopping as a form of regularization . . . . . . . . . . . . . . . . . 134
7.5 Regularization and under-constrained problems . . . . . . . . . . . . . . 139
7.6 Parameter Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.7 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.8 Dataset augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.9 Classical regularization as noise robustness . . . . . . . . . . . . . . . . 141
7.10 Semi-supervised Training . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.11 Unsupervised Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.11.1 The pretraining protocol. . . . . . . . . . . . . . . . . . . . . . . 142
7.12 Bagging and other ensemble methods . . . . . . . . . . . . . . . . . . . . 144
7.13 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.14 Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8 Optimization for training deep models 150
8.1 Optimization for model training . . . . . . . . . . . . . . . . . . . . . . . 150
8.1.1 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
8.1.2 Plateaus, saddle points, and other flat regions . . . . . . . . . . . 150
3
8.1.3 Cliffs and Exploding Gradients . . . . . . . . . . . . . . . . . . . 150
8.1.4 Vanishing and Exploding Gradients - An Introduction to the Issue
of Learning Long-Term Dependencies . . . . . . . . . . . . . . . 153
8.2 Optimization algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.2.1 Approximate Natural Gradient and Second-Order Methods . . . 156
8.2.2 Optimization strategies and meta-algorithms . . . . . . . . . . . 156
8.2.3 Coordinate descent . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.2.4 Greedy supervised pre-training . . . . . . . . . . . . . . . . . . . 157
8.3 Hints and Curriculum Learning . . . . . . . . . . . . . . . . . . . . . . . 157
9 Structured Probabilistic Models: A Deep Learning Perspective 158
9.1 The Challenge of Unstructured Modeling . . . . . . . . . . . . . . . . . 159
9.2 A Graphical Syntax for Describing Model Structure . . . . . . . . . . . 161
9.2.1 Directed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
9.2.2 Undirected Models . . . . . . . . . . . . . . . . . . . . . . . . . . 163
9.2.3 The Partition Function . . . . . . . . . . . . . . . . . . . . . . . 164
9.2.4 Energy-Based Models . . . . . . . . . . . . . . . . . . . . . . . . 166
9.2.5 Separation and D-Separation . . . . . . . . . . . . . . . . . . . . 167
9.2.6 Operations on a Graph . . . . . . . . . . . . . . . . . . . . . . . 169
9.2.7 Factor Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
9.3 Advantages of Structured Modeling . . . . . . . . . . . . . . . . . . . . . 171
9.4 Learning about Dependencies . . . . . . . . . . . . . . . . . . . . . . . . 173
9.4.1 Latent Variables Versus Structure Learning . . . . . . . . . . . . 173
9.4.2 Latent Variables for Feature Learning . . . . . . . . . . . . . . . 174
9.5 Markov Chain Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . 174
9.6 Inference and Approximate Inference Over Latent Variables . . . . . . . 174
9.7 The Deep Learning Approach to Structured Probabilistic Modeling . . . 176
9.7.1 Example: The Restricted Boltzmann Machine . . . . . . . . . . . 177
10 Unsupervised and Transfer Learning 179
10.1 Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
10.1.1 Regularized Auto-Encoders . . . . . . . . . . . . . . . . . . . . . 181
10.1.2 Representational Power, Layer Size and Depth . . . . . . . . . . 184
10.1.3 Reconstruction Distribution . . . . . . . . . . . . . . . . . . . . . 185
10.2 Linear Factor Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
10.2.1 Probabilistic PCA and Factor Analysis . . . . . . . . . . . . . . . 186
10.2.2 Manifold Interpretation of PCA and Linear Auto-Encoders . . . 188
10.2.3 ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
10.2.4 Sparse Coding as a Generative Model . . . . . . . . . . . . . . . 191
10.3 RBMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
10.4 Greedy Layerwise Unsupervised Pre-Training . . . . . . . . . . . . . . . 192
10.5 Transfer Learning and Domain Adaptation . . . . . . . . . . . . . . . . 193
4
11 Convolutional Networks 199
11.1 The convolution operation . . . . . . . . . . . . . . . . . . . . . . . . . . 199
11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
11.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
11.4 Variants of the basic convolution function . . . . . . . . . . . . . . . . . 209
11.5 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
11.6 Efficient convolution algorithms . . . . . . . . . . . . . . . . . . . . . . . 216
11.7 Deep learning history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
12 Sequence Modeling: Recurrent and Recursive Nets 217
12.1 Unfolding Flow Graphs and Sharing Parameters . . . . . . . . . . . . . 217
12.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 219
12.2.1 Computing the gradient in a recurrent neural network . . . . . . 221
12.2.2 Recurrent Networks as Generative Directed Acyclic Models . . . 223
12.2.3 RNNs to represent conditional probability distributions . . . . . 225
12.3 Bidirectional RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
12.4 Recursive Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 229
12.5 Auto-Regressive Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 230
12.5.1 Logistic Auto-Regressive Networks . . . . . . . . . . . . . . . . . 231
12.5.2 Neural Auto-Regressive Networks . . . . . . . . . . . . . . . . . . 232
12.5.3 NADE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
12.6 Facing the Challenge of Long-Term Dependencies . . . . . . . . . . . . . 235
12.6.1 Echo State Networks: Choosing Weights to Make Dynamics Barely
Contractive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
12.6.2 Combining Short and Long Paths in the Unfolded Flow Graph . 237
12.6.3 Leaky Units and a Hierarchy Different Time Scales . . . . . . . . 238
12.6.4 The Long-Short-Term-Memory Architecture and Other Gated RNNs239
12.6.5 Deep RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
12.6.6 Better Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 243
12.6.7 Clipping Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . 244
12.6.8 Regularizing to Encourage Information Flow . . . . . . . . . . . 245
12.6.9 Organizing the State at Multiple Time Scales . . . . . . . . . . . 245
12.7 Handling temporal dependencies with n-grams, HMMs, CRFs and other
graphical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
12.7.1 N-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
12.7.2 Efficient Marginalization and Inference for Temporally Structured
Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
12.7.3 HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
12.7.4 CRFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
12.8 Combining Neural Networks and Search . . . . . . . . . . . . . . . . . . 251
12.8.1 Joint Training of Neural Networks and Sequential Probabilistic
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
12.8.2 MAP and Structured Output Models . . . . . . . . . . . . . . . . 251
12.8.3 Back-prop through Search . . . . . . . . . . . . . . . . . . . . . . 251
5
12.9 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
13 The Manifold Perspective on Auto-Encoders 252
13.1 Manifold Learning via Regularized Auto-Encoders . . . . . . . . . . . . 261
13.2 Probabilistic Interpretation of Reconstruction Error as Log-Likelihood . 263
13.3 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
13.3.1 Sparse Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . 266
13.3.2 Predictive Sparse Decomposition . . . . . . . . . . . . . . . . . . 267
13.4 Denoising Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . 267
13.4.1 Learning a Vector Field that Estimates a Gradient Field . . . . . 269
13.4.2 Turning the Gradient Field into a Generative Model . . . . . . . 271
13.5 Contractive Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . 274
14 Distributed Representations: Disentangling the Underlying Factors 275
14.1 Assumption of Underlying Factors . . . . . . . . . . . . . . . . . . . . . 275
14.2 Exponential Gain in Representational Efficiency from Distributed Repre-
sentations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
14.3 Exponential Gain in Representational Efficiency from Depth . . . . . . . 275
14.4 Additional Priors Regarding The Underlying Factors . . . . . . . . . . . 275
15 Confronting the Partition Function 276
15.1 Estimating the partition function . . . . . . . . . . . . . . . . . . . . . . 276
15.1.1 Annealed importance sampling . . . . . . . . . . . . . . . . . . . 278
15.1.2 Bridge sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
15.1.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
15.2 Stochastic maximum likelihood and contrastive divergence . . . . . . . . 282
15.3 Pseudolikelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
15.4 Score matching and ratio matching . . . . . . . . . . . . . . . . . . . . . 291
15.5 Denoising score matching . . . . . . . . . . . . . . . . . . . . . . . . . . 293
15.6 Noise-contrastive estimation . . . . . . . . . . . . . . . . . . . . . . . . . 293
16 Approximate inference 296
16.1 Inference as optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 296
16.2 Expectation maximization . . . . . . . . . . . . . . . . . . . . . . . . . . 298
16.3 MAP inference: Sparse coding as a probabilistic model . . . . . . . . . . 299
16.4 Variational inference and learning . . . . . . . . . . . . . . . . . . . . . . 300
16.4.1 Discrete latent variables . . . . . . . . . . . . . . . . . . . . . . . 302
16.4.2 Calculus of variations . . . . . . . . . . . . . . . . . . . . . . . . 302
16.4.3 Continuous latent variables . . . . . . . . . . . . . . . . . . . . . 304
16.5 Stochastic inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
16.6 Learned approximate inference . . . . . . . . . . . . . . . . . . . . . . . 304
6
17 Deep generative models 305
17.1 Restricted Boltzmann machines . . . . . . . . . . . . . . . . . . . . . . . 305
17.2 Deep belief networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
17.3 Deep Boltzmann machines . . . . . . . . . . . . . . . . . . . . . . . . . . 308
17.3.1 Interesting properties . . . . . . . . . . . . . . . . . . . . . . . . 308
17.3.2 Variational learning with SML . . . . . . . . . . . . . . . . . . . 309
17.3.3 Layerwise pretraining . . . . . . . . . . . . . . . . . . . . . . . . 310
17.3.4 Multi-prediction deep Boltzmann machines . . . . . . . . . . . . 312
17.3.5 Centered deep Boltzmann machines . . . . . . . . . . . . . . . . 312
17.4 Boltzmann machines for real-valued data . . . . . . . . . . . . . . . . . . 312
17.4.1 Gaussian-Bernoulli RBMs . . . . . . . . . . . . . . . . . . . . . . 312
17.4.2 mcRBMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
17.4.3 Spike and slab restricted Boltzmann machines . . . . . . . . . . . 313
17.5 Convolutional Boltzmann machines . . . . . . . . . . . . . . . . . . . . . 313
17.6 Other Boltzmann machines . . . . . . . . . . . . . . . . . . . . . . . . . 314
17.7 Directed generative nets . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
17.7.1 Variational autoencoders . . . . . . . . . . . . . . . . . . . . . . 314
17.7.2 Generative adversarial networks . . . . . . . . . . . . . . . . . . 314
17.8 A generative view of autoencoders . . . . . . . . . . . . . . . . . . . . . 315
17.9 Generative stochastic networks . . . . . . . . . . . . . . . . . . . . . . . 315
17.10Methodological notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
18 Large scale deep learning 318
18.1 Fast CPU implementations . . . . . . . . . . . . . . . . . . . . . . . . . 318
18.2 GPU implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
18.3 Asynchronous parallel implementations . . . . . . . . . . . . . . . . . . . 318
18.4 Dynamically structured nets . . . . . . . . . . . . . . . . . . . . . . . . . 318
18.5 Model compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
19 Practical methodology 320
19.1 When to gather more data, control capacity, or change algorithms . . . 320
19.2 Machine Learning Methodology 101 . . . . . . . . . . . . . . . . . . . . 320
19.3 Manual hyperparameter tuning . . . . . . . . . . . . . . . . . . . . . . . 320
19.4 Hyper-parameter optimization algorithms . . . . . . . . . . . . . . . . . 320
19.5 Tricks of the Trade for Deep Learning . . . . . . . . . . . . . . . . . . . 322
19.5.1 Debugging Back-Prop . . . . . . . . . . . . . . . . . . . . . . . . 322
19.5.2 Automatic Differentation and Symbolic Manipulations of Flow
Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
19.5.3 Momentum and Other Averaging Techniques as Cheap Second
Order Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
7
20 Applications 323
20.1 Computer vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
20.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
20.1.2 Convolutional nets . . . . . . . . . . . . . . . . . . . . . . . . . . 329
20.2 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
20.3 Natural language processing and neural language models . . . . . . . . . 329
20.3.1 Neural language models . . . . . . . . . . . . . . . . . . . . . . . 329
20.4 Structured outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
20.5 Other applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
Bibliography 330
Index 348
8
Acknowledgments
We would like to thank the following people who commented our proposal for the
book and helped plan its contents and organization: Hugo Larochelle, Guillaume Alain,
Kyunghyun Cho, Caglar Gulcehre (TODO diacritics), Razvan Pascanu, David Krueger
and Thomas Roh´ee.
We would like to thank the following people who offered feedback on the content of
the book itself:
In many chapters: Pawel Chilinski.
Introduction: Johannes Roith, Eric Morris, Samira Ebrahimi, Ozan C¸ aglayan.
Math background chapters: Ilya Sutskever, Vincent Vanhoucke, Johannes Roith,
Linear algebra: Guillaume Alain, Dustin Webb, David Warde-Farley, Pierre Luc
Carrier, Li Yao, Thomas Roh´ee, Colby Toland, Amjad Almahairi, Sergey Oreshkov,
Probability: Rasmus Antti, Stephan Gouws, David Warde-Farley, Vincent Dumoulin,
Artem Oboturov, Li Yao. John Philip Anderson
Numerical: Meire Fortunato, Jurgen Van Gael. Dustin Webb
ML: Dzmitry Bahdanau Kelvin Xu
MLPs: Jurgen Van Gael
Convolutional nets: Guillaume Alain, David Warde-Farley, Mehdi Mirza, Caglar
Gulcehre.
Unsupervised: Kelvin Xu
Partition function: Sam Bowman.
Graphical models: Kelvin Xu
RNNs: Kelvin Xu Dmitriy Serdyuk
We also want to thank Jason Yosinski and Nicolas Chapados for contributing figures
(as noted in the captions).
TODO– this section is just notes, write it up in nice presentation form.
1