Deep Learning
Yoshua Bengio
Ian J. Goodfellow
Aaron Courville
January 1, 2015
Table of Contents
1 Deep Learning for AI 2
1.1 Who should read this book? . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Historical Perspective and Neural Networks . . . . . . . . . . . . . . . . 14
1.4 Recent Impact of Deep Learning Research . . . . . . . . . . . . . . . . . 15
1.5 Challenges for Future Research . . . . . . . . . . . . . . . . . . . . . . . 17
2 Linear algebra 20
2.1 Scalars, vectors, matrices and tensors . . . . . . . . . . . . . . . . . . . . 20
2.2 Multiplying matrices and vectors . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Identity and inverse matrices . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Linear dependence, span, and rank . . . . . . . . . . . . . . . . . . . . . 25
2.5 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 Special kinds of matrices and vectors . . . . . . . . . . . . . . . . . . . . 28
2.7 Eigendecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.8 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . 30
2.9 The trace operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.10 Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.11 Example: Principal components analysis . . . . . . . . . . . . . . . . . . 32
3 Probability and Information Theory 35
3.1 Why probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.1 Discrete variables and probability mass functions . . . . . . . . . 37
3.3.2 Continuous variables and probability density functions . . . . . . 38
3.4 Marginal probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6 The chain rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7 Independence and conditional independence . . . . . . . . . . . . . . . . 40
3.8 Expectation, variance, and covariance . . . . . . . . . . . . . . . . . . . 41
3.9 Information theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.10 Common probability distributions . . . . . . . . . . . . . . . . . . . . . 44
1
3.10.1 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . 44
3.10.2 Multinoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . 44
3.10.3 Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . 45
3.10.4 Dirac Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.10.5 Mixtures of Distributions and Gaussian Mixture . . . . . . . . . 48
3.11 Useful properties of common functions . . . . . . . . . . . . . . . . . . . 48
3.12 Bayes’ rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.13 Technical details of continuous variables . . . . . . . . . . . . . . . . . . 51
3.14 Example: Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4 Numerical Computation 56
4.1 Overflow and underflow . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Poor conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 Gradient-Based Optimization . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 Constrained optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5 Example: linear least squares . . . . . . . . . . . . . . . . . . . . . . . . 68
5 Machine Learning Basics 70
5.1 Learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.1 The task, T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.2 The performance measure, P . . . . . . . . . . . . . . . . . . . . 72
5.1.3 The experience, E . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Example: Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3 Generalization, Capacity, Overfitting and Underfitting . . . . . . . . . . 76
5.3.1 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3.2 Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.3 Occam’s Razor, Underfitting and Overfitting . . . . . . . . . . . 78
5.4 Estimating and Monitoring Generalization Error . . . . . . . . . . . . . 81
5.5 Estimators, Bias, and Variance . . . . . . . . . . . . . . . . . . . . . . . 81
5.5.1 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.5.2 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.5.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.5.4 Trading off Bias and Variance and the Mean Squared Error . . . 85
5.5.5 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.6 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . 86
5.6.1 Properties of Maximum Likelihood . . . . . . . . . . . . . . . . . 87
5.6.2 Regularized Likelihood . . . . . . . . . . . . . . . . . . . . . . . . 87
5.7 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.8 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.8.1 Estimating Conditional Expectation by Minimizing Squared Error 88
5.8.2 Estimating Probabilities or Conditional Probabilities by Maxi-
mum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.9 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.9.1 Principal Components Analysis . . . . . . . . . . . . . . . . . . . 90
2
5.10 Weakly supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.11 The Smoothness Prior, Local Generalization and Non-Parametric Models 93
5.12 Manifold Learning and the Curse of Dimensionality . . . . . . . . . . . . 97
5.13 Challenges of High-Dimensional Distributions . . . . . . . . . . . . . . . 100
6 Feedforward Deep Networks 102
6.1 Formalizing and Generalizing Neural Networks . . . . . . . . . . . . . . 102
6.2 Parametrizing a Learned Predictor . . . . . . . . . . . . . . . . . . . . . 105
6.2.1 Family of Functions . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.2.2 Loss Function and Conditional Log-Likelihood . . . . . . . . . . 106
6.2.3 Training Criterion and Regularizer . . . . . . . . . . . . . . . . . 111
6.2.4 Optimization Procedure . . . . . . . . . . . . . . . . . . . . . . . 112
6.3 Flow Graphs and Back-Propagation . . . . . . . . . . . . . . . . . . . . 113
6.3.1 Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.3.2 Back-Propagation in a General Flow Graph . . . . . . . . . . . . 116
6.4 Universal Approximation Properties and Depth . . . . . . . . . . . . . . 120
6.5 Feature / Representation Learning . . . . . . . . . . . . . . . . . . . . . 122
6.6 Piecewise Linear Hidden Units . . . . . . . . . . . . . . . . . . . . . . . 124
6.7 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7 Regularization 126
7.1 Classical Regularization: Parameter Norm Penalty . . . . . . . . . . . . 127
7.1.1 L
2
Parameter Regularization . . . . . . . . . . . . . . . . . . . . 128
7.1.2 L
1
Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.1.3 L
Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.2 Classical Regularization as Constrained Optimization . . . . . . . . . . 132
7.3 Regularization from a Bayesian Perspective . . . . . . . . . . . . . . . . 134
7.4 Early Stopping as a Form of Regularization . . . . . . . . . . . . . . . . 134
7.5 Regularization and Under-Constrained Problems . . . . . . . . . . . . . 139
7.6 Parameter Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.7 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.8 Dataset Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.9 Classical Regularization as Noise Robustness . . . . . . . . . . . . . . . 141
7.10 Semi-Supervised Training . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.11 Unsupervised Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.11.1 Pretraining Protocol . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.12 Bagging and Other Ensemble Methods . . . . . . . . . . . . . . . . . . . 144
7.13 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.14 Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8 Optimization for training deep models 150
8.1 Optimization for model training . . . . . . . . . . . . . . . . . . . . . . . 150
8.1.1 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
8.1.2 Plateaus, saddle points, and other flat regions . . . . . . . . . . . 150
3
8.1.3 Cliffs and Exploding Gradients . . . . . . . . . . . . . . . . . . . 150
8.1.4 Vanishing and Exploding Gradients - An Introduction to the Issue
of Learning Long-Term Dependencies . . . . . . . . . . . . . . . 153
8.2 Optimization algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.2.1 Approximate Natural Gradient and Second-Order Methods . . . 156
8.2.2 Optimization strategies and meta-algorithms . . . . . . . . . . . 156
8.2.3 Coordinate descent . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.2.4 Greedy supervised pre-training . . . . . . . . . . . . . . . . . . . 157
8.3 Hints and Curriculum Learning . . . . . . . . . . . . . . . . . . . . . . . 157
9 Structured Probabilistic Models: A Deep Learning Perspective 158
9.1 The Challenge of Unstructured Modeling . . . . . . . . . . . . . . . . . 159
9.2 A Graphical Syntax for Describing Model Structure . . . . . . . . . . . 161
9.2.1 Directed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
9.2.2 Undirected Models . . . . . . . . . . . . . . . . . . . . . . . . . . 163
9.2.3 The Partition Function . . . . . . . . . . . . . . . . . . . . . . . 164
9.2.4 Energy-Based Models . . . . . . . . . . . . . . . . . . . . . . . . 166
9.2.5 Separation and D-Separation . . . . . . . . . . . . . . . . . . . . 167
9.2.6 Operations on a Graph . . . . . . . . . . . . . . . . . . . . . . . 169
9.2.7 Factor Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
9.3 Advantages of Structured Modeling . . . . . . . . . . . . . . . . . . . . . 171
9.4 Learning about Dependencies . . . . . . . . . . . . . . . . . . . . . . . . 171
9.4.1 Latent Variables Versus Structure Learning . . . . . . . . . . . . 171
9.4.2 Latent Variables for Feature Learning . . . . . . . . . . . . . . . 172
9.5 Markov Chain Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . 173
9.6 Inference and Approximate Inference Over Latent Variables . . . . . . . 174
9.7 The Deep Learning Approach to Structured Probabilistic Modeling . . . 176
9.7.1 Example: The Restricted Boltzmann Machine . . . . . . . . . . . 177
10 Unsupervised and Transfer Learning 179
10.1 Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
10.1.1 Regularized Auto-Encoders . . . . . . . . . . . . . . . . . . . . . 181
10.1.2 Representational Power, Layer Size and Depth . . . . . . . . . . 184
10.1.3 Reconstruction Distribution . . . . . . . . . . . . . . . . . . . . . 185
10.2 Linear Factor Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
10.2.1 Probabilistic PCA and Factor Analysis . . . . . . . . . . . . . . . 186
10.2.2 Manifold Interpretation of PCA and Linear Auto-Encoders . . . 188
10.2.3 ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
10.2.4 Sparse Coding as a Generative Model . . . . . . . . . . . . . . . 191
10.3 RBMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
10.4 Greedy Layerwise Unsupervised Pre-Training . . . . . . . . . . . . . . . 192
10.5 Transfer Learning and Domain Adaptation . . . . . . . . . . . . . . . . 193
4
11 Convolutional Networks 199
11.1 The convolution operation . . . . . . . . . . . . . . . . . . . . . . . . . . 199
11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
11.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
11.4 Variants of the basic convolution function . . . . . . . . . . . . . . . . . 209
11.5 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
11.6 Efficient convolution algorithms . . . . . . . . . . . . . . . . . . . . . . . 216
11.7 Deep learning history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
12 Sequence Modeling: Recurrent and Recursive Nets 217
12.1 Unfolding Flow Graphs and Sharing Parameters . . . . . . . . . . . . . 217
12.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 219
12.2.1 Computing the gradient in a recurrent neural network . . . . . . 221
12.2.2 Recurrent Networks as Generative Directed Acyclic Models . . . 223
12.2.3 RNNs to represent conditional probability distributions . . . . . 225
12.3 Bidirectional RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
12.4 Recursive Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 229
12.5 Auto-Regressive Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 230
12.5.1 Logistic Auto-Regressive Networks . . . . . . . . . . . . . . . . . 231
12.5.2 Neural Auto-Regressive Networks . . . . . . . . . . . . . . . . . . 232
12.5.3 NADE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
12.6 Facing the Challenge of Long-Term Dependencies . . . . . . . . . . . . . 235
12.6.1 Echo State Networks: Choosing Weights to Make Dynamics Barely
Contractive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
12.6.2 Combining Short and Long Paths in the Unfolded Flow Graph . 237
12.6.3 Leaky Units and a Hierarchy Different Time Scales . . . . . . . . 238
12.6.4 The Long-Short-Term-Memory Architecture and Other Gated RNNs239
12.6.5 Deep RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
12.6.6 Better Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 243
12.6.7 Clipping Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . 244
12.6.8 Regularizing to Encourage Information Flow . . . . . . . . . . . 245
12.6.9 Organizing the State at Multiple Time Scales . . . . . . . . . . . 245
12.7 Handling temporal dependencies with n-grams, HMMs, CRFs and other
graphical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
12.7.1 N-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
12.7.2 Efficient Marginalization and Inference for Temporally Structured
Outputs by Dynamic Programming . . . . . . . . . . . . . . . . . 247
12.7.3 HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
12.7.4 CRFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
12.8 Combining Neural Networks and Search . . . . . . . . . . . . . . . . . . 256
12.8.1 Approximate Search . . . . . . . . . . . . . . . . . . . . . . . . . 257
5
13 The Manifold Perspective on Auto-Encoders 261
13.1 Manifold Learning via Regularized Auto-Encoders . . . . . . . . . . . . 269
13.2 Probabilistic Interpretation of Reconstruction Error as Log-Likelihood . 272
13.3 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
13.3.1 Sparse Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . 274
13.3.2 Predictive Sparse Decomposition . . . . . . . . . . . . . . . . . . 276
13.4 Denoising Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . 277
13.4.1 Learning a Vector Field that Estimates a Gradient Field . . . . . 279
13.4.2 Turning the Gradient Field into a Generative Model . . . . . . . 281
13.5 Contractive Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . 284
13.6 Tangent Distance, Tangent-Prop, and Manifold Tangent Classifier . . . 285
14 Distributed Representations: Disentangling the Underlying Factors 288
14.1 Causality and Semi-Supervised Learning . . . . . . . . . . . . . . . . . . 288
14.2 Assumption of Underlying Factors and Distributed Representation . . . 290
14.3 Exponential Gain in Representational Efficiency from Distributed Repre-
sentations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
14.4 Exponential Gain in Representational Efficiency from Depth . . . . . . . 295
14.5 Priors Regarding The Underlying Factors . . . . . . . . . . . . . . . . . 298
15 Confronting the Partition Function 301
15.1 Estimating the partition function . . . . . . . . . . . . . . . . . . . . . . 301
15.1.1 Annealed importance sampling . . . . . . . . . . . . . . . . . . . 303
15.1.2 Bridge sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
15.1.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
15.2 Stochastic maximum likelihood and contrastive divergence . . . . . . . . 307
15.3 Pseudolikelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
15.4 Score matching and ratio matching . . . . . . . . . . . . . . . . . . . . . 316
15.5 Denoising score matching . . . . . . . . . . . . . . . . . . . . . . . . . . 318
15.6 Noise-contrastive estimation . . . . . . . . . . . . . . . . . . . . . . . . . 318
16 Approximate inference 321
16.1 Inference as optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 321
16.2 Expectation maximization . . . . . . . . . . . . . . . . . . . . . . . . . . 323
16.3 MAP inference: Sparse coding as a probabilistic model . . . . . . . . . . 324
16.4 Variational inference and learning . . . . . . . . . . . . . . . . . . . . . . 325
16.4.1 Discrete latent variables . . . . . . . . . . . . . . . . . . . . . . . 327
16.4.2 Calculus of variations . . . . . . . . . . . . . . . . . . . . . . . . 327
16.4.3 Continuous latent variables . . . . . . . . . . . . . . . . . . . . . 329
16.5 Stochastic inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
16.6 Learned approximate inference . . . . . . . . . . . . . . . . . . . . . . . 329
6
17 Deep generative models 330
17.1 Restricted Boltzmann machines . . . . . . . . . . . . . . . . . . . . . . . 330
17.2 Deep belief networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
17.3 Deep Boltzmann machines . . . . . . . . . . . . . . . . . . . . . . . . . . 333
17.3.1 Interesting properties . . . . . . . . . . . . . . . . . . . . . . . . 333
17.3.2 Variational learning with SML . . . . . . . . . . . . . . . . . . . 334
17.3.3 Layerwise pretraining . . . . . . . . . . . . . . . . . . . . . . . . 335
17.3.4 Multi-prediction deep Boltzmann machines . . . . . . . . . . . . 337
17.3.5 Centered deep Boltzmann machines . . . . . . . . . . . . . . . . 337
17.4 Boltzmann machines for real-valued data . . . . . . . . . . . . . . . . . . 337
17.4.1 Gaussian-Bernoulli RBMs . . . . . . . . . . . . . . . . . . . . . . 337
17.4.2 mcRBMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
17.4.3 Spike and slab restricted Boltzmann machines . . . . . . . . . . . 338
17.5 Convolutional Boltzmann machines . . . . . . . . . . . . . . . . . . . . . 338
17.6 Other Boltzmann machines . . . . . . . . . . . . . . . . . . . . . . . . . 339
17.7 Directed generative nets . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
17.7.1 Variational autoencoders . . . . . . . . . . . . . . . . . . . . . . 339
17.7.2 Variational interpretation of PSD . . . . . . . . . . . . . . . . . . 339
17.7.3 Generative adversarial networks . . . . . . . . . . . . . . . . . . 339
17.8 A generative view of autoencoders . . . . . . . . . . . . . . . . . . . . . 340
17.9 Generative stochastic networks . . . . . . . . . . . . . . . . . . . . . . . 340
17.10Methodological notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
18 Large scale deep learning 343
18.1 Fast CPU implementations . . . . . . . . . . . . . . . . . . . . . . . . . 343
18.2 GPU implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
18.3 Asynchronous parallel implementations . . . . . . . . . . . . . . . . . . . 343
18.4 Dynamically structured nets . . . . . . . . . . . . . . . . . . . . . . . . . 343
18.5 Model compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
19 Practical methodology 345
19.1 When to gather more data, control capacity, or change algorithms . . . 345
19.2 Machine Learning Methodology 101 . . . . . . . . . . . . . . . . . . . . 345
19.3 Manual hyperparameter tuning . . . . . . . . . . . . . . . . . . . . . . . 345
19.4 Hyper-parameter optimization algorithms . . . . . . . . . . . . . . . . . 345
19.5 Tricks of the Trade for Deep Learning . . . . . . . . . . . . . . . . . . . 347
19.5.1 Debugging Back-Prop . . . . . . . . . . . . . . . . . . . . . . . . 347
19.5.2 Automatic Differentation and Symbolic Manipulations of Flow
Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
19.5.3 Momentum and Other Averaging Techniques as Cheap Second
Order Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
7
20 Applications 348
20.1 Computer vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
20.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
20.1.2 Convolutional nets . . . . . . . . . . . . . . . . . . . . . . . . . . 354
20.2 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
20.3 Natural language processing and neural language models . . . . . . . . . 354
20.3.1 Neural language models . . . . . . . . . . . . . . . . . . . . . . . 354
20.4 Structured outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
20.5 Other applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
Bibliography 355
Index 376
8
Acknowledgments
We would like to thank the following people who commented our proposal for the
book and helped plan its contents and organization: Hugo Larochelle, Guillaume Alain,
Kyunghyun Cho, Caglar Gulcehre (TODO diacritics), Razvan Pascanu, David Krueger
and Thomas Roh´ee.
We would like to thank the following people who offered feedback on the content of
the book itself:
In many chapters: Pawel Chilinski.
Introduction: Johannes Roith, Eric Morris, Samira Ebrahimi, Ozan C¸ aglayan.
Math background chapters: Ilya Sutskever, Vincent Vanhoucke, Johannes Roith,
Linear algebra: Guillaume Alain, Dustin Webb, David Warde-Farley, Pierre Luc
Carrier, Li Yao, Thomas Roh´ee, Colby Toland, Amjad Almahairi, Sergey Oreshkov,
Probability: Rasmus Antti, Stephan Gouws, David Warde-Farley, Vincent Dumoulin,
Artem Oboturov, Li Yao. John Philip Anderson
Numerical: Meire Fortunato, Jurgen Van Gael. Dustin Webb
ML: Dzmitry Bahdanau Kelvin Xu
MLPs: Jurgen Van Gael
Convolutional nets: Guillaume Alain, David Warde-Farley, Mehdi Mirza, Caglar
Gulcehre.
Unsupervised: Kelvin Xu
Partition function: Sam Bowman.
Graphical models: Kelvin Xu
RNNs: Kelvin Xu Dmitriy Serdyuk
We also want to thank Jason Yosinski and Nicolas Chapados for contributing figures
(as noted in the captions).
TODO– this section is just notes, write it up in nice presentation form.
1