Deep Learning
Yoshua Bengio
Ian J. Goodfellow
Aaron Courville
October 21, 2014
Table of Contents
1 Deep Learning for AI 2
1.1 Who should read this book? . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Historical Perspective and Neural Networks . . . . . . . . . . . . . . . . 14
1.4 Recent Impact of Deep Learning Research . . . . . . . . . . . . . . . . . 15
1.5 Challenges for Future Research . . . . . . . . . . . . . . . . . . . . . . . 17
2 Linear algebra 20
2.1 Scalars, vectors, matrices and tensors . . . . . . . . . . . . . . . . . . . . 20
2.2 Multiplying matrices and vectors . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Identity and inverse matrices . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Linear dependence, span, and rank . . . . . . . . . . . . . . . . . . . . . 25
2.5 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 Special kinds of matrices and vectors . . . . . . . . . . . . . . . . . . . . 27
2.7 Eigendecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.8 The trace operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.9 Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.10 Example: Principal components analysis . . . . . . . . . . . . . . . . . . 31
3 Probability and Information Theory 34
3.1 Why probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 Discrete variables and probability mass functions . . . . . . . . . 36
3.3.2 Continuous variables and probability density functions . . . . . . 37
3.4 Marginal probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 The chain rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.7 Independence and conditional independence . . . . . . . . . . . . . . . . 39
3.8 Expectation, variance, and covariance . . . . . . . . . . . . . . . . . . . 40
3.9 Information theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.10 Common probability distributions . . . . . . . . . . . . . . . . . . . . . 43
3.10.1 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . 43
1
3.10.2 Multinoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . 43
3.10.3 Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . 44
3.10.4 Dirac Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.10.5 Mixtures of Distributions and Gaussian Mixture . . . . . . . . . 47
3.11 Useful properties of common functions . . . . . . . . . . . . . . . . . . . 47
3.12 Bayes’ rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.13 Technical details of continuous variables . . . . . . . . . . . . . . . . . . 50
3.14 Example: Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4 Numerical Computation 54
4.1 Overflow and underflow . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Poor conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 Gradient-Based Optimization . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Constrained optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5 Example: linear least squares . . . . . . . . . . . . . . . . . . . . . . . . 65
5 Machine Learning Basics 67
5.1 Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.1.1 The task, T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.1.2 The performance measure, P . . . . . . . . . . . . . . . . . . . . 69
5.1.3 The experience, E . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Example: Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3 Generalization, Capacity, Overfitting and Underfitting . . . . . . . . . . 73
5.3.1 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3.2 Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3.3 Occam’s Razor, Underfitting and Overfitting . . . . . . . . . . . 75
5.4 Estimating and Monitoring Generalization Error . . . . . . . . . . . . . 78
5.5 Estimators, Bias, and Variance . . . . . . . . . . . . . . . . . . . . . . . 78
5.5.1 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.5.2 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.5.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.5.4 Trading off Bias and Variance and the Mean Squared Error . . . 81
5.5.5 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.6 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . 82
5.6.1 Properties of Maximum Likelihood . . . . . . . . . . . . . . . . . 82
5.6.2 Regularized Likelihood . . . . . . . . . . . . . . . . . . . . . . . . 83
5.7 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.8 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.8.1 Estimating Conditional Expectation by Minimizing Squared Error 84
5.8.2 Estimating Probabilities or Conditional Probabilities by Maxi-
mum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.9 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.10 The Smoothness Prior, Local Generalization and Non-Parametric Models 85
5.11 Manifold Learning and the Curse of Dimensionality . . . . . . . . . . . . 90
2
5.12 Challenges of High-Dimensional Distributions . . . . . . . . . . . . . . . 93
6 Feedforward Deep Networks 95
6.1 Formalizing and Generalizing Neural Networks . . . . . . . . . . . . . . 95
6.2 Parametrizing a Learned Predictor . . . . . . . . . . . . . . . . . . . . . 97
6.2.1 Family of Functions . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.2.2 Loss Function and Conditional Log-Likelihood . . . . . . . . . . 99
6.2.3 Training Criterion and Regularizer . . . . . . . . . . . . . . . . . 104
6.2.4 Optimization Procedure . . . . . . . . . . . . . . . . . . . . . . . 105
6.3 Flow Graphs and Back-Propagation . . . . . . . . . . . . . . . . . . . . 106
6.3.1 Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.3.2 Back-Propagation in a General Flow Graph . . . . . . . . . . . . 108
6.4 Universal Approximation Properties and Depth . . . . . . . . . . . . . . 112
6.5 Feature / Representation Learning . . . . . . . . . . . . . . . . . . . . . 114
6.6 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7 Regularization 116
7.1 Classical Regularization: Parameter Norm Penalty . . . . . . . . . . . . 117
7.1.1 L
2
regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.1.2 L
1
regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.1.3 L
regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.2 Regularization from a Bayesian perspective . . . . . . . . . . . . . . . . 119
7.3 Early Stopping as a Form of Regularization . . . . . . . . . . . . . . . . 119
7.4 Parameter Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.5 Sparsity of Representations . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.6 Semi-supervised Training . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.7 Early stopping as a form of regularization . . . . . . . . . . . . . . . . . 120
7.8 Unsupervised Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.8.1 The pretraining protocol. . . . . . . . . . . . . . . . . . . . . . . 121
7.9 Bagging and other ensemble methods . . . . . . . . . . . . . . . . . . . . 123
7.10 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.11 Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8 Optimization for training deep models 129
8.1 Optimization for model training . . . . . . . . . . . . . . . . . . . . . . . 129
8.1.1 Surrogate loss functions . . . . . . . . . . . . . . . . . . . . . . . 129
8.1.2 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
8.1.3 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
8.1.4 Data parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
8.2 Local Minima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
8.3 Ill-Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
8.4 Plateaus, saddle points, and other flat regions . . . . . . . . . . . . . . . 130
8.5 Cliffs and Exploding Gradients . . . . . . . . . . . . . . . . . . . . . . . 130
3
8.6 Vanishing and Exploding Gradients - An Introduction to the Issue of
Learning Long-Term Dependencies . . . . . . . . . . . . . . . . . . . . . 130
8.7 Inexact gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
8.8 Optimization algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
8.8.1 Approximate Natural Gradient and Second-Order Methods . . . 130
8.9 Challenges in optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 131
8.9.1 Optimization strategies and meta-algorithms . . . . . . . . . . . 131
8.9.2 Initialization strategies . . . . . . . . . . . . . . . . . . . . . . . . 131
8.9.3 Greedy Supervised Pre-Training . . . . . . . . . . . . . . . . . . 131
8.9.4 Designing models to aid optimization . . . . . . . . . . . . . . . 131
8.10 Hints and Curriculum Learning . . . . . . . . . . . . . . . . . . . . . . . 131
9 Structured Probabilistic Models: A Deep Learning Perspective 132
9.1 The challenge of unstructured modeling . . . . . . . . . . . . . . . . . . 133
9.2 A graphical syntax for describing model structure . . . . . . . . . . . . . 135
9.2.1 Directed models . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
9.2.2 Undirected models . . . . . . . . . . . . . . . . . . . . . . . . . . 137
9.2.3 The Partition Function . . . . . . . . . . . . . . . . . . . . . . . 138
9.2.4 Energy-Based Models . . . . . . . . . . . . . . . . . . . . . . . . 140
9.2.5 Separation and d-separation . . . . . . . . . . . . . . . . . . . . . 141
9.2.6 Operations on a graph . . . . . . . . . . . . . . . . . . . . . . . . 143
9.2.7 Factor graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
9.3 Advantages of structured modeling . . . . . . . . . . . . . . . . . . . . . 145
9.4 Learning about dependencies . . . . . . . . . . . . . . . . . . . . . . . . 147
9.4.1 Latent variables versus structure learning . . . . . . . . . . . . . 147
9.4.2 Latent variables for feature learning . . . . . . . . . . . . . . . . 148
9.5 The deep learning approach to structured probabilistic modeling . . . . 148
9.5.1 Example: The restricted Boltzmann machine . . . . . . . . . . . 148
9.6 Markov chain Monte Carlo methods . . . . . . . . . . . . . . . . . . . . 149
10 Unsupervised and Transfer Learning 151
10.1 Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
10.1.1 Regularized Auto-Encoders . . . . . . . . . . . . . . . . . . . . . 153
10.1.2 Representational Power, Layer Size and Depth . . . . . . . . . . 156
10.1.3 Reconstruction Distribution . . . . . . . . . . . . . . . . . . . . . 157
10.2 Linear Factor Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
10.2.1 Probabilistic PCA and Factor Analysis . . . . . . . . . . . . . . . 158
10.2.2 Manifold Interpretation of PCA and Linear Auto-Encoders . . . 160
10.2.3 ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
10.2.4 Sparse Coding as a Generative Model . . . . . . . . . . . . . . . 163
10.3 RBMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
10.4 Greedy Layerwise Unsupervised Pre-Training . . . . . . . . . . . . . . . 164
10.5 Transfer Learning and Domain Adaptation . . . . . . . . . . . . . . . . 165
4
11 Convolutional Networks 169
11.1 The convolution operation . . . . . . . . . . . . . . . . . . . . . . . . . . 169
11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
11.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
11.4 Variants of the basic convolution function . . . . . . . . . . . . . . . . . 179
11.5 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
11.6 Efficient convolution algorithms . . . . . . . . . . . . . . . . . . . . . . . 186
11.7 Deep learning history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
12 Sequence Modeling: Recurrent and Recursive Nets 187
12.1 Unfolding Flow Graphs and Sharing Parameters . . . . . . . . . . . . . 187
12.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 189
12.2.1 Computing the gradient in a recurrent neural network . . . . . . 191
12.2.2 Recurrent Networks as Generative Directed Acyclic Models . . . 193
12.2.3 RNNs to represent conditional probability distributions . . . . . 195
12.3 Bidirectional RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
12.4 Recursive Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 199
12.5 Auto-Regressive Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 200
12.5.1 Logistic Auto-Regressive Networks . . . . . . . . . . . . . . . . . 201
12.5.2 Neural Auto-Regressive Networks . . . . . . . . . . . . . . . . . . 202
12.5.3 NADE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
12.6 Facing the Challenge of Long-Term Dependencies . . . . . . . . . . . . . 205
12.6.1 Echo State Networks: Choosing Weights to Make Dynamics Barely
Contractive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
12.6.2 Combining Short and Long Paths in the Unfolded Flow Graph . 205
12.6.3 The Long-Short-Term-Memory architecture . . . . . . . . . . . . 205
12.6.4 Better Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 205
12.6.5 Clipping Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . 205
12.6.6 Regularizing to Encourage Information Flow . . . . . . . . . . . 205
12.6.7 Organizing the State at Multiple Time Scales . . . . . . . . . . . 205
12.7 Handling temporal dependencies with n-grams, HMMs, CRFs and other
graphical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
12.7.1 N-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
12.7.2 HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
12.7.3 CRFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
12.8 Combining Neural Networks and Search . . . . . . . . . . . . . . . . . . 206
12.8.1 Joint Training of Neural Networks and Sequential Probabilistic
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
12.8.2 MAP and Structured Output Models . . . . . . . . . . . . . . . . 206
12.8.3 Back-prop through Search . . . . . . . . . . . . . . . . . . . . . . 206
12.9 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
5
13 The Manifold Perspective on Auto-Encoders 207
13.1 Manifold Learning via Regularized Auto-Encoders . . . . . . . . . . . . 216
13.2 Probabilistic Interpretation of Reconstruction Error as Log-Likelihood . 218
13.3 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
13.3.1 Sparse Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . 221
13.3.2 Predictive Sparse Decomposition . . . . . . . . . . . . . . . . . . 222
13.4 Denoising Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . 222
13.4.1 Learning a Vector Field that Estimates a Gradient Field . . . . . 224
13.4.2 Turning the Gradient Field into a Generative Model . . . . . . . 226
13.5 Contractive Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . 229
14 Distributed Representations: Disentangling the Underlying Factors 230
14.1 Assumption of Underlying Factors . . . . . . . . . . . . . . . . . . . . . 230
14.2 Exponential Gain in Representational Efficiency from Distributed Repre-
sentations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
14.3 Exponential Gain in Representational Efficiency from Depth . . . . . . . 230
14.4 Additional Priors Regarding The Underlying Factors . . . . . . . . . . . 230
15 Confronting the Partition Function 231
15.1 Estimating the partition function . . . . . . . . . . . . . . . . . . . . . . 231
15.1.1 Annealed importance sampling . . . . . . . . . . . . . . . . . . . 233
15.1.2 Bridge sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
15.1.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
15.2 Stochastic maximum likelihood and contrastive divergence . . . . . . . . 237
15.3 Pseudolikelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
15.4 Score matching and ratio matching . . . . . . . . . . . . . . . . . . . . . 246
15.5 Denoising score matching . . . . . . . . . . . . . . . . . . . . . . . . . . 248
15.6 Noise-contrastive estimation . . . . . . . . . . . . . . . . . . . . . . . . . 248
16 Approximate inference 251
16.1 Inference as optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 251
16.2 Expectation maximization . . . . . . . . . . . . . . . . . . . . . . . . . . 253
16.3 MAP inference: Sparse coding as a probabilistic model . . . . . . . . . . 254
16.4 Variational inference and learning . . . . . . . . . . . . . . . . . . . . . . 255
16.4.1 Discrete latent variables . . . . . . . . . . . . . . . . . . . . . . . 256
16.4.2 Calculus of variations . . . . . . . . . . . . . . . . . . . . . . . . 257
16.4.3 Continuous latent variables . . . . . . . . . . . . . . . . . . . . . 258
16.5 Stochastic inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
16.6 Learned approximate inference . . . . . . . . . . . . . . . . . . . . . . . 259
17 Deep generative models 260
17.1 Restricted Boltzmann machines . . . . . . . . . . . . . . . . . . . . . . . 260
17.2 Deep belief networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
17.3 Deep Boltzmann machines . . . . . . . . . . . . . . . . . . . . . . . . . . 263
6
17.3.1 Interesting properties . . . . . . . . . . . . . . . . . . . . . . . . 263
17.3.2 Variational learning with SML . . . . . . . . . . . . . . . . . . . 264
17.3.3 Layerwise pretraining . . . . . . . . . . . . . . . . . . . . . . . . 264
17.3.4 Multi-prediction deep Boltzmann machines . . . . . . . . . . . . 266
17.3.5 Centered deep Boltzmann machines . . . . . . . . . . . . . . . . 266
17.4 Boltzmann machines for real-valued data . . . . . . . . . . . . . . . . . . 266
17.4.1 Gaussian-Bernoulli RBMs . . . . . . . . . . . . . . . . . . . . . . 266
17.4.2 mcRBMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
17.4.3 Spike and slab restricted Boltzmann machines . . . . . . . . . . . 268
17.5 Convolutional Boltzmann machines . . . . . . . . . . . . . . . . . . . . . 268
17.6 Other Boltzmann machines . . . . . . . . . . . . . . . . . . . . . . . . . 269
17.7 Directed generative nets . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
17.7.1 Variational autoencoders . . . . . . . . . . . . . . . . . . . . . . 269
17.7.2 Generative adversarial networks . . . . . . . . . . . . . . . . . . 269
17.8 A generative view of autoencoders . . . . . . . . . . . . . . . . . . . . . 270
17.9 Generative stochastic networks . . . . . . . . . . . . . . . . . . . . . . . 270
18 Large scale deep learning 271
18.1 Fast CPU implementations . . . . . . . . . . . . . . . . . . . . . . . . . 271
18.2 GPU implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
18.3 Asynchronous parallel implementations . . . . . . . . . . . . . . . . . . . 271
18.4 Dynamically structured nets . . . . . . . . . . . . . . . . . . . . . . . . . 271
18.5 Model compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
19 Practical methodology 273
19.1 When to gather more data, control capacity, or change algorithms . . . 273
19.2 Machine Learning Methodology 101 . . . . . . . . . . . . . . . . . . . . 273
19.3 Manual hyperparameter tuning . . . . . . . . . . . . . . . . . . . . . . . 273
19.4 Hyper-parameter optimization algorithms . . . . . . . . . . . . . . . . . 273
19.5 Tricks of the Trade for Deep Learning . . . . . . . . . . . . . . . . . . . 275
19.5.1 Debugging Back-Prop . . . . . . . . . . . . . . . . . . . . . . . . 275
19.5.2 Automatic Differentation and Symbolic Manipulations of Flow
Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
19.5.3 Momentum and Other Averaging Techniques as Cheap Second
Order Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
20 Applications 276
20.1 Computer vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
20.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
20.1.2 Convolutional nets . . . . . . . . . . . . . . . . . . . . . . . . . . 282
20.2 Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
20.3 Natural language processing . . . . . . . . . . . . . . . . . . . . . . . . . 282
20.4 Structured outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
20.5 Other applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
7
Bibliography 283
Index 298
8
Acknowledgments
We would like to thank the following people who commented our proposal for the
book and helped plan its contents and organization: Hugo Larochelle, Guillaume Alain,
Kyunghyun Cho, Caglar Gulcehre (TODO diacritics), Razvan Pascanu, David Krueger
and Thomas Roh´ee.
We would like to thank the following people who offered feedback on the content of
the book itself:
Introduction: Johannes Roith, Eric Morris, Ozan C¸ aglayan.
Math background chapters: Ilya Sutskever, Vincent Vanhoucke, Johannes Roith,
Linear algebra: Guillaume Alain, Dustin Webb, David Warde-Farley, Pierre Luc
Carrier, Li Yao, Thomas Roh´ee, Colby Toland, Amjad Almahairi, Sergey Oreshkov,
Probability: Rasmus Antti, Stephan Gouws, David Warde-Farley, Vincent Dumoulin,
Artem Oboturov, Li Yao.
Numerical: Meire Fortunato.
Convolutional nets: Guillaume Alain, David Warde-Farley, Mehdi Mirza, Caglar
Gulcehre.
Partition function: Sam Bowman.
We also want to thank Jason Yosinski and Nicolas Chapados for contributing figures
(as noted in the captions).
TODO– this section is just notes, write it up in nice presentation form.
1